From: Mark Kettenis Subject: Re: Faster _exit(2) for a faster userland: R.I.P the reaper To: Mark Kettenis Cc: cjeker@diehard.n-r-g.com, tedu@tedunangst.com, dlg@openbsd.org, visa@openbsd.org, tech@openbsd.org, jan@openbsd.org, ajacoutot@openbsd.org Date: Mon, 12 May 2025 17:42:57 +0200 > Date: Sun, 11 May 2025 23:27:38 +0200 > From: Mark Kettenis Just a heads-up that I got a panic running this diff. So it probably needs more work. > > Date: Wed, 7 May 2025 11:11:35 +0200 > > From: Claudio Jeker > > I'm not sure we should let ourselves be distracted by optimizing pmap > teardown. But I did explore things a bit further (on arm64) and some > observations are relevant to the ordering of operations in exit1(). > > > > > On Tue, May 06, 2025 at 02:52:11PM +0200, Mark Kettenis wrote: > > > > Date: Tue, 6 May 2025 10:46:01 +0200 > > > > From: Claudio Jeker > > > > > > > > On Mon, May 05, 2025 at 10:46:38PM +0200, Mark Kettenis wrote: > > > > > > Date: Mon, 5 May 2025 17:25:02 +0200 > > > > > > From: Claudio Jeker > > > > > > > > > > > > On Sun, May 04, 2025 at 08:38:06AM -0600, Theo de Raadt wrote: > > > > > > > Martin Pieuchot wrote: > > > > > > > > > > > > > > > > I do not understand any of the diff. It seems to be trying to return a > > > > > > > > > child to the parent, before resources are cleared. > > > > > > > > > > > > > > > > It doesn't. Order of operations didn't change. The KERNEL_LOCK() is > > > > > > > > there and asserted. > > > > > > > > > > > > > > I did not mention the KERNEL_LOCK(). > > > > > > > > > > > > > > > I disagree. I need to remove the reaper. That *is* the problem. > > > > > > > > > > > > > > You are using proc0 in an very dangerous way. > > > > > > > > > > > > > > > > I've called this uvmspace_purge(). It would be a bit like > > > > > > > > > > > > > > > > > > uvm_unmap_remove(p->p_vmspace, 0, VM_MAXUSER_ADDRESS, > > > > > > > > > &dead_entries, FALSE, TRUE, TRUE) > > > > > > > > > > > > > > > > > > Except there are clearly optimizations. The function can sleep. It is > > > > > > > > > called quite early in process context. > > > > > > > > > [...] > > > > > > > > > > > > > > > > There's no need to add a new function. Calling uvm_exit() from exit1() > > > > > > > > is enough and work but that's not buying us much. > > > > > > > > > > > > > > As proc0? Disagree strongly on this direction. The process should free > > > > > > > it's own pages out of it's own map. It can do that if uvm_exit() is split. > > > > > > > > > > > > > > But you've made up your mind to ignore everything everyone says. You don't > > > > > > > even reply to emails about the proposals. Your public requests for feedback > > > > > > > come off as fake. > > > > > > > > > > > > Here is a minimal diff implementing uvm_purge() in the dumbest way > > > > > > possible. It removes the reaper from the hot path without any scary hacks. > > > > > > > > > > > > It only fixes uvm_map_teardown() so it can be called again. > > > > > > > > > > But we shouldn't have to call it again no? > > > > > > > > There is a complication that has to do with VM space sharing (which is > > > > only used by proc0). So it is easier for now to offload proc0 handling to > > > > uvm_exit() (which does the refcnt update and check if it hit 0). > > > > > > > > I think we can kill VM space sharing and use real threads for kthread. > > > > This would remove the vm space refcnt and make code a fair bit simpler. > > > > guenther@ wanted that as well so maybe I look into that. > > > > > > > > > > I tried to zap the pmap first (with pmap_remove of the full user address > > > > > > space) but that is very slow. > > > > > > > > > > Our page tables are typically only populated sparsely. So this > > > > > doesn't surprise me. > > > > > > > > Yes, it was a try, I was hoping that pmap_remove would be walk the sparse > > > > pmap tables efficently. It does not. > > > > > > > > > > I think pmap can be optimized a lot here but we don't have an api > > > > > > for that yet. > > > > > > > > > > What optimizations are you thinking off? The only "easy" optimization > > > > > I found in the teardown was replacing the TLB flushes with a single > > > > > ASID-based TLB flush on arm64. > > > > > > > > So there are a few things I noticed. uvm_map_teardown() spends about the > > > > same amount of time in uvm as in pmap. It calls pmap many many times so > > > > there is a lot of call overhead. > > > > > > On arm64 more than 50% of the time spent in pmap during teardown was > > > spent doing the individual TLB flushes. > > > > > > > The moment uvm purges the userland mappings a split KU arch (arm64, > > > > sparc64, ...) can nuke the pmap, flush out the tlbs via ASID and then > > > > instruct uvm_map_teardown() to skip any pmap calls. > > > > > > Only sparc64 is completely split KU; arm64 is somewhere in the middle, > > > with a single address space that is divided in two halves with > > > separate page tables covering each half. > > > > > > But even on these architectures you can't simply skip any pmap calls. > > > The problem here is pmap_page_protect() and the associated management > > > of the PV lists. If you simply skip pmap_remove() calls, the PV lists > > > will contain dangling pointers. > > > > It should be possible to do the pmap_remove() bits upfront. E.g. call > > pmap_remove_pv() which takes the pte_desc off the PV list. > > In the end what we want is a function that cleans out all U entries from > > the pmap in the most efficent way. Right now pmap_remove is called a lot > > and there is a fair bit of overhead (e.g. pmap_vp_lookup of every page in > > the range). > > So the diff below adds a missing pmap_remove_pv() call into > pmap_vp_destroy_l2_l3() and then moves the call to pmap_vp_destroy() > that was in pmap_release() into pmap_deactivate(). That'll tear down > the pmaps and V->P data structures as soon as the last thread in a > process calls cpu_exit(). > > I added short-circuit checks in pmap_remove() and pmap_protect() but > those didn't result in measurable performance changes in my kernel > build benchmarking. > > In fact the only measurable change for the diff is an increase in > system time attributed to the build. We go from: > > 1m01.85s real 5m24.67s user 2m36.90s system > > to: > > 1m01.89s real 5m23.25s user 2m53.81s system > > That increase in system time makes sense since we're now doing some of > the work of the reaper when the process exits. > > I'm curious to see what happens on your 80-core beast. > > > > > Still doing this will still allow uvm_map_teardown() to skip most pmap > > > > calls. > > > > > > As indicated above, this isn't so simple. We can only really skip > > > pmap_remove() calls for pages that are completely private to the uvm > > > space that we're tearing down. But I'm not even sure if we can easily > > > tell whether a page is private or not based on the uvm data > > > structures. > > > > The idea is that if we do the pmap work upfront then we can shortcut > > pmap_remove() in uvm_map_teardown() since the work is already done. > > If you mmap 4GB of mem but only touch 7 pages in that range then > > pmap_remove will do 1+ million pmap_vp_lookup() calls for 7 hits. > > I think this is something that can be optimized. > > That is a bit of a silly thing to do. But I guess doing silly things > is the norm these days. I don't think pmap_vp_lookup() is terribly > expensive, especially if you do a lookup of sequential addresses as > the relevant parts of the data structures will be cached. At least my > benchmarks don't point into that direction. But since we have to walk > the entire page table anyway to tear down the page directories we > might as well optimize things and clear out everything that did end up > being mapped. > > So I think the order in which things need to be done is: > > 1. Make sure all threads in the process have switched away from the > userland page tables. > > 2. Flush all TLB entries for the process. > > 3. Tear down the pmap. > > 4. Tear down the vmspace. > > With my diff, step 4 still happens in the reaper. Moving it into > exit1() needs a little bit of care since step 1-3 currently happen in > cpu_exit() which is the last thing that exit1() does... > > > Index: arch/arm64/arm64/pmap.c > =================================================================== > RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v > diff -u -p -r1.111 pmap.c > --- arch/arm64/arm64/pmap.c 14 Feb 2025 18:36:04 -0000 1.111 > +++ arch/arm64/arm64/pmap.c 11 May 2025 20:42:38 -0000 > @@ -695,6 +695,9 @@ pmap_remove(pmap_t pm, vaddr_t sva, vadd > struct pte_desc *pted; > vaddr_t va; > > + if (!pm->pm_active) > + return; > + > pmap_lock(pm); > for (va = sva; va < eva; va += PAGE_SIZE) { > pted = pmap_vp_lookup(pm, va, NULL); > @@ -1007,7 +1010,6 @@ pmap_destroy(pmap_t pm) > void > pmap_release(pmap_t pm) > { > - pmap_vp_destroy(pm); > } > > void > @@ -1035,7 +1037,10 @@ pmap_vp_destroy_l2_l3(pmap_t pm, struct > if (pted == NULL) > continue; > vp3->vp[l] = NULL; > - > + > + if (PTED_MANAGED(pted)) > + pmap_remove_pv(pted); > + > pool_put(&pmap_pted_pool, pted); > } > pool_put(&pmap_vp_pool, vp3); > @@ -1505,6 +1510,8 @@ pmap_deactivate(struct proc *p) > > cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48); > cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48); > + > + pmap_vp_destroy(pm); > } > > /* > @@ -1634,6 +1641,9 @@ pmap_page_protect(struct vm_page *pg, vm > void > pmap_protect(pmap_t pm, vaddr_t sva, vaddr_t eva, vm_prot_t prot) > { > + if (!pm->pm_active) > + return; > + > if (prot & (PROT_READ | PROT_EXEC)) { > pmap_lock(pm); > while (sva < eva) { > >