Download raw body.
Faster _exit(2) for a faster userland: R.I.P the reaper
> Date: Mon, 12 May 2025 17:42:57 +0200
> From: Mark Kettenis <mark.kettenis@xs4all.nl>
>
> Just a heads-up that I got a panic running this diff. So it probably
> needs more work.
Adding some missing locking seems to help.
> > > Date: Wed, 7 May 2025 11:11:35 +0200
> > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> >
> > I'm not sure we should let ourselves be distracted by optimizing pmap
> > teardown. But I did explore things a bit further (on arm64) and some
> > observations are relevant to the ordering of operations in exit1().
> >
> > >
> > > On Tue, May 06, 2025 at 02:52:11PM +0200, Mark Kettenis wrote:
> > > > > Date: Tue, 6 May 2025 10:46:01 +0200
> > > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > > >
> > > > > On Mon, May 05, 2025 at 10:46:38PM +0200, Mark Kettenis wrote:
> > > > > > > Date: Mon, 5 May 2025 17:25:02 +0200
> > > > > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > > > > >
> > > > > > > On Sun, May 04, 2025 at 08:38:06AM -0600, Theo de Raadt wrote:
> > > > > > > > Martin Pieuchot <mpi@grenadille.net> wrote:
> > > > > > > >
> > > > > > > > > > I do not understand any of the diff. It seems to be trying to return a
> > > > > > > > > > child to the parent, before resources are cleared.
> > > > > > > > >
> > > > > > > > > It doesn't. Order of operations didn't change. The KERNEL_LOCK() is
> > > > > > > > > there and asserted.
> > > > > > > >
> > > > > > > > I did not mention the KERNEL_LOCK().
> > > > > > > >
> > > > > > > > > I disagree. I need to remove the reaper. That *is* the problem.
> > > > > > > >
> > > > > > > > You are using proc0 in an very dangerous way.
> > > > > > > >
> > > > > > > > > > I've called this uvmspace_purge(). It would be a bit like
> > > > > > > > > >
> > > > > > > > > > uvm_unmap_remove(p->p_vmspace, 0, VM_MAXUSER_ADDRESS,
> > > > > > > > > > &dead_entries, FALSE, TRUE, TRUE)
> > > > > > > > > >
> > > > > > > > > > Except there are clearly optimizations. The function can sleep. It is
> > > > > > > > > > called quite early in process context.
> > > > > > > > > > [...]
> > > > > > > > >
> > > > > > > > > There's no need to add a new function. Calling uvm_exit() from exit1()
> > > > > > > > > is enough and work but that's not buying us much.
> > > > > > > >
> > > > > > > > As proc0? Disagree strongly on this direction. The process should free
> > > > > > > > it's own pages out of it's own map. It can do that if uvm_exit() is split.
> > > > > > > >
> > > > > > > > But you've made up your mind to ignore everything everyone says. You don't
> > > > > > > > even reply to emails about the proposals. Your public requests for feedback
> > > > > > > > come off as fake.
> > > > > > >
> > > > > > > Here is a minimal diff implementing uvm_purge() in the dumbest way
> > > > > > > possible. It removes the reaper from the hot path without any scary hacks.
> > > > > > >
> > > > > > > It only fixes uvm_map_teardown() so it can be called again.
> > > > > >
> > > > > > But we shouldn't have to call it again no?
> > > > >
> > > > > There is a complication that has to do with VM space sharing (which is
> > > > > only used by proc0). So it is easier for now to offload proc0 handling to
> > > > > uvm_exit() (which does the refcnt update and check if it hit 0).
> > > > >
> > > > > I think we can kill VM space sharing and use real threads for kthread.
> > > > > This would remove the vm space refcnt and make code a fair bit simpler.
> > > > > guenther@ wanted that as well so maybe I look into that.
> > > > >
> > > > > > > I tried to zap the pmap first (with pmap_remove of the full user address
> > > > > > > space) but that is very slow.
> > > > > >
> > > > > > Our page tables are typically only populated sparsely. So this
> > > > > > doesn't surprise me.
> > > > >
> > > > > Yes, it was a try, I was hoping that pmap_remove would be walk the sparse
> > > > > pmap tables efficently. It does not.
> > > > >
> > > > > > > I think pmap can be optimized a lot here but we don't have an api
> > > > > > > for that yet.
> > > > > >
> > > > > > What optimizations are you thinking off? The only "easy" optimization
> > > > > > I found in the teardown was replacing the TLB flushes with a single
> > > > > > ASID-based TLB flush on arm64.
> > > > >
> > > > > So there are a few things I noticed. uvm_map_teardown() spends about the
> > > > > same amount of time in uvm as in pmap. It calls pmap many many times so
> > > > > there is a lot of call overhead.
> > > >
> > > > On arm64 more than 50% of the time spent in pmap during teardown was
> > > > spent doing the individual TLB flushes.
> > > >
> > > > > The moment uvm purges the userland mappings a split KU arch (arm64,
> > > > > sparc64, ...) can nuke the pmap, flush out the tlbs via ASID and then
> > > > > instruct uvm_map_teardown() to skip any pmap calls.
> > > >
> > > > Only sparc64 is completely split KU; arm64 is somewhere in the middle,
> > > > with a single address space that is divided in two halves with
> > > > separate page tables covering each half.
> > > >
> > > > But even on these architectures you can't simply skip any pmap calls.
> > > > The problem here is pmap_page_protect() and the associated management
> > > > of the PV lists. If you simply skip pmap_remove() calls, the PV lists
> > > > will contain dangling pointers.
> > >
> > > It should be possible to do the pmap_remove() bits upfront. E.g. call
> > > pmap_remove_pv() which takes the pte_desc off the PV list.
> > > In the end what we want is a function that cleans out all U entries from
> > > the pmap in the most efficent way. Right now pmap_remove is called a lot
> > > and there is a fair bit of overhead (e.g. pmap_vp_lookup of every page in
> > > the range).
> >
> > So the diff below adds a missing pmap_remove_pv() call into
> > pmap_vp_destroy_l2_l3() and then moves the call to pmap_vp_destroy()
> > that was in pmap_release() into pmap_deactivate(). That'll tear down
> > the pmaps and V->P data structures as soon as the last thread in a
> > process calls cpu_exit().
> >
> > I added short-circuit checks in pmap_remove() and pmap_protect() but
> > those didn't result in measurable performance changes in my kernel
> > build benchmarking.
> >
> > In fact the only measurable change for the diff is an increase in
> > system time attributed to the build. We go from:
> >
> > 1m01.85s real 5m24.67s user 2m36.90s system
> >
> > to:
> >
> > 1m01.89s real 5m23.25s user 2m53.81s system
> >
> > That increase in system time makes sense since we're now doing some of
> > the work of the reaper when the process exits.
> >
> > I'm curious to see what happens on your 80-core beast.
> >
> > > > > Still doing this will still allow uvm_map_teardown() to skip most pmap
> > > > > calls.
> > > >
> > > > As indicated above, this isn't so simple. We can only really skip
> > > > pmap_remove() calls for pages that are completely private to the uvm
> > > > space that we're tearing down. But I'm not even sure if we can easily
> > > > tell whether a page is private or not based on the uvm data
> > > > structures.
> > >
> > > The idea is that if we do the pmap work upfront then we can shortcut
> > > pmap_remove() in uvm_map_teardown() since the work is already done.
> > > If you mmap 4GB of mem but only touch 7 pages in that range then
> > > pmap_remove will do 1+ million pmap_vp_lookup() calls for 7 hits.
> > > I think this is something that can be optimized.
> >
> > That is a bit of a silly thing to do. But I guess doing silly things
> > is the norm these days. I don't think pmap_vp_lookup() is terribly
> > expensive, especially if you do a lookup of sequential addresses as
> > the relevant parts of the data structures will be cached. At least my
> > benchmarks don't point into that direction. But since we have to walk
> > the entire page table anyway to tear down the page directories we
> > might as well optimize things and clear out everything that did end up
> > being mapped.
> >
> > So I think the order in which things need to be done is:
> >
> > 1. Make sure all threads in the process have switched away from the
> > userland page tables.
> >
> > 2. Flush all TLB entries for the process.
> >
> > 3. Tear down the pmap.
> >
> > 4. Tear down the vmspace.
> >
> > With my diff, step 4 still happens in the reaper. Moving it into
> > exit1() needs a little bit of care since step 1-3 currently happen in
> > cpu_exit() which is the last thing that exit1() does...
Index: arch/arm64/arm64/pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
diff -u -p -r1.111 pmap.c
--- arch/arm64/arm64/pmap.c 14 Feb 2025 18:36:04 -0000 1.111
+++ arch/arm64/arm64/pmap.c 12 May 2025 18:11:39 -0000
@@ -695,6 +695,9 @@ pmap_remove(pmap_t pm, vaddr_t sva, vadd
struct pte_desc *pted;
vaddr_t va;
+ if (!pm->pm_active)
+ return;
+
pmap_lock(pm);
for (va = sva; va < eva; va += PAGE_SIZE) {
pted = pmap_vp_lookup(pm, va, NULL);
@@ -1007,7 +1010,6 @@ pmap_destroy(pmap_t pm)
void
pmap_release(pmap_t pm)
{
- pmap_vp_destroy(pm);
}
void
@@ -1035,7 +1037,10 @@ pmap_vp_destroy_l2_l3(pmap_t pm, struct
if (pted == NULL)
continue;
vp3->vp[l] = NULL;
-
+
+ if (PTED_MANAGED(pted))
+ pmap_remove_pv(pted);
+
pool_put(&pmap_pted_pool, pted);
}
pool_put(&pmap_vp_pool, vp3);
@@ -1505,6 +1510,10 @@ pmap_deactivate(struct proc *p)
cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48);
cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48);
+
+ pmap_lock(pm);
+ pmap_vp_destroy(pm);
+ pmap_unlock(pm);
}
/*
@@ -1634,6 +1643,9 @@ pmap_page_protect(struct vm_page *pg, vm
void
pmap_protect(pmap_t pm, vaddr_t sva, vaddr_t eva, vm_prot_t prot)
{
+ if (!pm->pm_active)
+ return;
+
if (prot & (PROT_READ | PROT_EXEC)) {
pmap_lock(pm);
while (sva < eva) {
Faster _exit(2) for a faster userland: R.I.P the reaper