From: Mark Kettenis Subject: Re: Faster _exit(2) for a faster userland: R.I.P the reaper To: Claudio Jeker Cc: tedu@tedunangst.com, dlg@openbsd.org, visa@openbsd.org, tech@openbsd.org, jan@openbsd.org, ajacoutot@openbsd.org Date: Sun, 11 May 2025 23:27:38 +0200 > Date: Wed, 7 May 2025 11:11:35 +0200 > From: Claudio Jeker I'm not sure we should let ourselves be distracted by optimizing pmap teardown. But I did explore things a bit further (on arm64) and some observations are relevant to the ordering of operations in exit1(). > > On Tue, May 06, 2025 at 02:52:11PM +0200, Mark Kettenis wrote: > > > Date: Tue, 6 May 2025 10:46:01 +0200 > > > From: Claudio Jeker > > > > > > On Mon, May 05, 2025 at 10:46:38PM +0200, Mark Kettenis wrote: > > > > > Date: Mon, 5 May 2025 17:25:02 +0200 > > > > > From: Claudio Jeker > > > > > > > > > > On Sun, May 04, 2025 at 08:38:06AM -0600, Theo de Raadt wrote: > > > > > > Martin Pieuchot wrote: > > > > > > > > > > > > > > I do not understand any of the diff. It seems to be trying to return a > > > > > > > > child to the parent, before resources are cleared. > > > > > > > > > > > > > > It doesn't. Order of operations didn't change. The KERNEL_LOCK() is > > > > > > > there and asserted. > > > > > > > > > > > > I did not mention the KERNEL_LOCK(). > > > > > > > > > > > > > I disagree. I need to remove the reaper. That *is* the problem. > > > > > > > > > > > > You are using proc0 in an very dangerous way. > > > > > > > > > > > > > > I've called this uvmspace_purge(). It would be a bit like > > > > > > > > > > > > > > > > uvm_unmap_remove(p->p_vmspace, 0, VM_MAXUSER_ADDRESS, > > > > > > > > &dead_entries, FALSE, TRUE, TRUE) > > > > > > > > > > > > > > > > Except there are clearly optimizations. The function can sleep. It is > > > > > > > > called quite early in process context. > > > > > > > > [...] > > > > > > > > > > > > > > There's no need to add a new function. Calling uvm_exit() from exit1() > > > > > > > is enough and work but that's not buying us much. > > > > > > > > > > > > As proc0? Disagree strongly on this direction. The process should free > > > > > > it's own pages out of it's own map. It can do that if uvm_exit() is split. > > > > > > > > > > > > But you've made up your mind to ignore everything everyone says. You don't > > > > > > even reply to emails about the proposals. Your public requests for feedback > > > > > > come off as fake. > > > > > > > > > > Here is a minimal diff implementing uvm_purge() in the dumbest way > > > > > possible. It removes the reaper from the hot path without any scary hacks. > > > > > > > > > > It only fixes uvm_map_teardown() so it can be called again. > > > > > > > > But we shouldn't have to call it again no? > > > > > > There is a complication that has to do with VM space sharing (which is > > > only used by proc0). So it is easier for now to offload proc0 handling to > > > uvm_exit() (which does the refcnt update and check if it hit 0). > > > > > > I think we can kill VM space sharing and use real threads for kthread. > > > This would remove the vm space refcnt and make code a fair bit simpler. > > > guenther@ wanted that as well so maybe I look into that. > > > > > > > > I tried to zap the pmap first (with pmap_remove of the full user address > > > > > space) but that is very slow. > > > > > > > > Our page tables are typically only populated sparsely. So this > > > > doesn't surprise me. > > > > > > Yes, it was a try, I was hoping that pmap_remove would be walk the sparse > > > pmap tables efficently. It does not. > > > > > > > > I think pmap can be optimized a lot here but we don't have an api > > > > > for that yet. > > > > > > > > What optimizations are you thinking off? The only "easy" optimization > > > > I found in the teardown was replacing the TLB flushes with a single > > > > ASID-based TLB flush on arm64. > > > > > > So there are a few things I noticed. uvm_map_teardown() spends about the > > > same amount of time in uvm as in pmap. It calls pmap many many times so > > > there is a lot of call overhead. > > > > On arm64 more than 50% of the time spent in pmap during teardown was > > spent doing the individual TLB flushes. > > > > > The moment uvm purges the userland mappings a split KU arch (arm64, > > > sparc64, ...) can nuke the pmap, flush out the tlbs via ASID and then > > > instruct uvm_map_teardown() to skip any pmap calls. > > > > Only sparc64 is completely split KU; arm64 is somewhere in the middle, > > with a single address space that is divided in two halves with > > separate page tables covering each half. > > > > But even on these architectures you can't simply skip any pmap calls. > > The problem here is pmap_page_protect() and the associated management > > of the PV lists. If you simply skip pmap_remove() calls, the PV lists > > will contain dangling pointers. > > It should be possible to do the pmap_remove() bits upfront. E.g. call > pmap_remove_pv() which takes the pte_desc off the PV list. > In the end what we want is a function that cleans out all U entries from > the pmap in the most efficent way. Right now pmap_remove is called a lot > and there is a fair bit of overhead (e.g. pmap_vp_lookup of every page in > the range). So the diff below adds a missing pmap_remove_pv() call into pmap_vp_destroy_l2_l3() and then moves the call to pmap_vp_destroy() that was in pmap_release() into pmap_deactivate(). That'll tear down the pmaps and V->P data structures as soon as the last thread in a process calls cpu_exit(). I added short-circuit checks in pmap_remove() and pmap_protect() but those didn't result in measurable performance changes in my kernel build benchmarking. In fact the only measurable change for the diff is an increase in system time attributed to the build. We go from: 1m01.85s real 5m24.67s user 2m36.90s system to: 1m01.89s real 5m23.25s user 2m53.81s system That increase in system time makes sense since we're now doing some of the work of the reaper when the process exits. I'm curious to see what happens on your 80-core beast. > > > Still doing this will still allow uvm_map_teardown() to skip most pmap > > > calls. > > > > As indicated above, this isn't so simple. We can only really skip > > pmap_remove() calls for pages that are completely private to the uvm > > space that we're tearing down. But I'm not even sure if we can easily > > tell whether a page is private or not based on the uvm data > > structures. > > The idea is that if we do the pmap work upfront then we can shortcut > pmap_remove() in uvm_map_teardown() since the work is already done. > If you mmap 4GB of mem but only touch 7 pages in that range then > pmap_remove will do 1+ million pmap_vp_lookup() calls for 7 hits. > I think this is something that can be optimized. That is a bit of a silly thing to do. But I guess doing silly things is the norm these days. I don't think pmap_vp_lookup() is terribly expensive, especially if you do a lookup of sequential addresses as the relevant parts of the data structures will be cached. At least my benchmarks don't point into that direction. But since we have to walk the entire page table anyway to tear down the page directories we might as well optimize things and clear out everything that did end up being mapped. So I think the order in which things need to be done is: 1. Make sure all threads in the process have switched away from the userland page tables. 2. Flush all TLB entries for the process. 3. Tear down the pmap. 4. Tear down the vmspace. With my diff, step 4 still happens in the reaper. Moving it into exit1() needs a little bit of care since step 1-3 currently happen in cpu_exit() which is the last thing that exit1() does... Index: arch/arm64/arm64/pmap.c =================================================================== RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v diff -u -p -r1.111 pmap.c --- arch/arm64/arm64/pmap.c 14 Feb 2025 18:36:04 -0000 1.111 +++ arch/arm64/arm64/pmap.c 11 May 2025 20:42:38 -0000 @@ -695,6 +695,9 @@ pmap_remove(pmap_t pm, vaddr_t sva, vadd struct pte_desc *pted; vaddr_t va; + if (!pm->pm_active) + return; + pmap_lock(pm); for (va = sva; va < eva; va += PAGE_SIZE) { pted = pmap_vp_lookup(pm, va, NULL); @@ -1007,7 +1010,6 @@ pmap_destroy(pmap_t pm) void pmap_release(pmap_t pm) { - pmap_vp_destroy(pm); } void @@ -1035,7 +1037,10 @@ pmap_vp_destroy_l2_l3(pmap_t pm, struct if (pted == NULL) continue; vp3->vp[l] = NULL; - + + if (PTED_MANAGED(pted)) + pmap_remove_pv(pted); + pool_put(&pmap_pted_pool, pted); } pool_put(&pmap_vp_pool, vp3); @@ -1505,6 +1510,8 @@ pmap_deactivate(struct proc *p) cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48); cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48); + + pmap_vp_destroy(pm); } /* @@ -1634,6 +1641,9 @@ pmap_page_protect(struct vm_page *pg, vm void pmap_protect(pmap_t pm, vaddr_t sva, vaddr_t eva, vm_prot_t prot) { + if (!pm->pm_active) + return; + if (prot & (PROT_READ | PROT_EXEC)) { pmap_lock(pm); while (sva < eva) {