From: Mark Kettenis Subject: Re: Faster _exit(2) for a faster userland: R.I.P the reaper To: Mark Kettenis Cc: cjeker@diehard.n-r-g.com, tedu@tedunangst.com, dlg@openbsd.org, visa@openbsd.org, tech@openbsd.org, jan@openbsd.org, ajacoutot@openbsd.org Date: Mon, 12 May 2025 20:12:50 +0200 > Date: Mon, 12 May 2025 17:42:57 +0200 > From: Mark Kettenis > > Just a heads-up that I got a panic running this diff. So it probably > needs more work. Adding some missing locking seems to help. > > > Date: Wed, 7 May 2025 11:11:35 +0200 > > > From: Claudio Jeker > > > > I'm not sure we should let ourselves be distracted by optimizing pmap > > teardown. But I did explore things a bit further (on arm64) and some > > observations are relevant to the ordering of operations in exit1(). > > > > > > > > On Tue, May 06, 2025 at 02:52:11PM +0200, Mark Kettenis wrote: > > > > > Date: Tue, 6 May 2025 10:46:01 +0200 > > > > > From: Claudio Jeker > > > > > > > > > > On Mon, May 05, 2025 at 10:46:38PM +0200, Mark Kettenis wrote: > > > > > > > Date: Mon, 5 May 2025 17:25:02 +0200 > > > > > > > From: Claudio Jeker > > > > > > > > > > > > > > On Sun, May 04, 2025 at 08:38:06AM -0600, Theo de Raadt wrote: > > > > > > > > Martin Pieuchot wrote: > > > > > > > > > > > > > > > > > > I do not understand any of the diff. It seems to be trying to return a > > > > > > > > > > child to the parent, before resources are cleared. > > > > > > > > > > > > > > > > > > It doesn't. Order of operations didn't change. The KERNEL_LOCK() is > > > > > > > > > there and asserted. > > > > > > > > > > > > > > > > I did not mention the KERNEL_LOCK(). > > > > > > > > > > > > > > > > > I disagree. I need to remove the reaper. That *is* the problem. > > > > > > > > > > > > > > > > You are using proc0 in an very dangerous way. > > > > > > > > > > > > > > > > > > I've called this uvmspace_purge(). It would be a bit like > > > > > > > > > > > > > > > > > > > > uvm_unmap_remove(p->p_vmspace, 0, VM_MAXUSER_ADDRESS, > > > > > > > > > > &dead_entries, FALSE, TRUE, TRUE) > > > > > > > > > > > > > > > > > > > > Except there are clearly optimizations. The function can sleep. It is > > > > > > > > > > called quite early in process context. > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > There's no need to add a new function. Calling uvm_exit() from exit1() > > > > > > > > > is enough and work but that's not buying us much. > > > > > > > > > > > > > > > > As proc0? Disagree strongly on this direction. The process should free > > > > > > > > it's own pages out of it's own map. It can do that if uvm_exit() is split. > > > > > > > > > > > > > > > > But you've made up your mind to ignore everything everyone says. You don't > > > > > > > > even reply to emails about the proposals. Your public requests for feedback > > > > > > > > come off as fake. > > > > > > > > > > > > > > Here is a minimal diff implementing uvm_purge() in the dumbest way > > > > > > > possible. It removes the reaper from the hot path without any scary hacks. > > > > > > > > > > > > > > It only fixes uvm_map_teardown() so it can be called again. > > > > > > > > > > > > But we shouldn't have to call it again no? > > > > > > > > > > There is a complication that has to do with VM space sharing (which is > > > > > only used by proc0). So it is easier for now to offload proc0 handling to > > > > > uvm_exit() (which does the refcnt update and check if it hit 0). > > > > > > > > > > I think we can kill VM space sharing and use real threads for kthread. > > > > > This would remove the vm space refcnt and make code a fair bit simpler. > > > > > guenther@ wanted that as well so maybe I look into that. > > > > > > > > > > > > I tried to zap the pmap first (with pmap_remove of the full user address > > > > > > > space) but that is very slow. > > > > > > > > > > > > Our page tables are typically only populated sparsely. So this > > > > > > doesn't surprise me. > > > > > > > > > > Yes, it was a try, I was hoping that pmap_remove would be walk the sparse > > > > > pmap tables efficently. It does not. > > > > > > > > > > > > I think pmap can be optimized a lot here but we don't have an api > > > > > > > for that yet. > > > > > > > > > > > > What optimizations are you thinking off? The only "easy" optimization > > > > > > I found in the teardown was replacing the TLB flushes with a single > > > > > > ASID-based TLB flush on arm64. > > > > > > > > > > So there are a few things I noticed. uvm_map_teardown() spends about the > > > > > same amount of time in uvm as in pmap. It calls pmap many many times so > > > > > there is a lot of call overhead. > > > > > > > > On arm64 more than 50% of the time spent in pmap during teardown was > > > > spent doing the individual TLB flushes. > > > > > > > > > The moment uvm purges the userland mappings a split KU arch (arm64, > > > > > sparc64, ...) can nuke the pmap, flush out the tlbs via ASID and then > > > > > instruct uvm_map_teardown() to skip any pmap calls. > > > > > > > > Only sparc64 is completely split KU; arm64 is somewhere in the middle, > > > > with a single address space that is divided in two halves with > > > > separate page tables covering each half. > > > > > > > > But even on these architectures you can't simply skip any pmap calls. > > > > The problem here is pmap_page_protect() and the associated management > > > > of the PV lists. If you simply skip pmap_remove() calls, the PV lists > > > > will contain dangling pointers. > > > > > > It should be possible to do the pmap_remove() bits upfront. E.g. call > > > pmap_remove_pv() which takes the pte_desc off the PV list. > > > In the end what we want is a function that cleans out all U entries from > > > the pmap in the most efficent way. Right now pmap_remove is called a lot > > > and there is a fair bit of overhead (e.g. pmap_vp_lookup of every page in > > > the range). > > > > So the diff below adds a missing pmap_remove_pv() call into > > pmap_vp_destroy_l2_l3() and then moves the call to pmap_vp_destroy() > > that was in pmap_release() into pmap_deactivate(). That'll tear down > > the pmaps and V->P data structures as soon as the last thread in a > > process calls cpu_exit(). > > > > I added short-circuit checks in pmap_remove() and pmap_protect() but > > those didn't result in measurable performance changes in my kernel > > build benchmarking. > > > > In fact the only measurable change for the diff is an increase in > > system time attributed to the build. We go from: > > > > 1m01.85s real 5m24.67s user 2m36.90s system > > > > to: > > > > 1m01.89s real 5m23.25s user 2m53.81s system > > > > That increase in system time makes sense since we're now doing some of > > the work of the reaper when the process exits. > > > > I'm curious to see what happens on your 80-core beast. > > > > > > > Still doing this will still allow uvm_map_teardown() to skip most pmap > > > > > calls. > > > > > > > > As indicated above, this isn't so simple. We can only really skip > > > > pmap_remove() calls for pages that are completely private to the uvm > > > > space that we're tearing down. But I'm not even sure if we can easily > > > > tell whether a page is private or not based on the uvm data > > > > structures. > > > > > > The idea is that if we do the pmap work upfront then we can shortcut > > > pmap_remove() in uvm_map_teardown() since the work is already done. > > > If you mmap 4GB of mem but only touch 7 pages in that range then > > > pmap_remove will do 1+ million pmap_vp_lookup() calls for 7 hits. > > > I think this is something that can be optimized. > > > > That is a bit of a silly thing to do. But I guess doing silly things > > is the norm these days. I don't think pmap_vp_lookup() is terribly > > expensive, especially if you do a lookup of sequential addresses as > > the relevant parts of the data structures will be cached. At least my > > benchmarks don't point into that direction. But since we have to walk > > the entire page table anyway to tear down the page directories we > > might as well optimize things and clear out everything that did end up > > being mapped. > > > > So I think the order in which things need to be done is: > > > > 1. Make sure all threads in the process have switched away from the > > userland page tables. > > > > 2. Flush all TLB entries for the process. > > > > 3. Tear down the pmap. > > > > 4. Tear down the vmspace. > > > > With my diff, step 4 still happens in the reaper. Moving it into > > exit1() needs a little bit of care since step 1-3 currently happen in > > cpu_exit() which is the last thing that exit1() does... Index: arch/arm64/arm64/pmap.c =================================================================== RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v diff -u -p -r1.111 pmap.c --- arch/arm64/arm64/pmap.c 14 Feb 2025 18:36:04 -0000 1.111 +++ arch/arm64/arm64/pmap.c 12 May 2025 18:11:39 -0000 @@ -695,6 +695,9 @@ pmap_remove(pmap_t pm, vaddr_t sva, vadd struct pte_desc *pted; vaddr_t va; + if (!pm->pm_active) + return; + pmap_lock(pm); for (va = sva; va < eva; va += PAGE_SIZE) { pted = pmap_vp_lookup(pm, va, NULL); @@ -1007,7 +1010,6 @@ pmap_destroy(pmap_t pm) void pmap_release(pmap_t pm) { - pmap_vp_destroy(pm); } void @@ -1035,7 +1037,10 @@ pmap_vp_destroy_l2_l3(pmap_t pm, struct if (pted == NULL) continue; vp3->vp[l] = NULL; - + + if (PTED_MANAGED(pted)) + pmap_remove_pv(pted); + pool_put(&pmap_pted_pool, pted); } pool_put(&pmap_vp_pool, vp3); @@ -1505,6 +1510,10 @@ pmap_deactivate(struct proc *p) cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48); cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48); + + pmap_lock(pm); + pmap_vp_destroy(pm); + pmap_unlock(pm); } /* @@ -1634,6 +1643,9 @@ pmap_page_protect(struct vm_page *pg, vm void pmap_protect(pmap_t pm, vaddr_t sva, vaddr_t eva, vm_prot_t prot) { + if (!pm->pm_active) + return; + if (prot & (PROT_READ | PROT_EXEC)) { pmap_lock(pm); while (sva < eva) {