From: Mark Kettenis Subject: Re: uvm_purge() To: Martin Pieuchot Cc: claudio@openbsd.org, tech@openbsd.org Date: Thu, 15 May 2025 14:45:56 +0200 > Date: Thu, 15 May 2025 12:34:16 +0200 > From: Martin Pieuchot > > On 15/05/25(Thu) 11:52, Mark Kettenis wrote: > > > Date: Wed, 14 May 2025 11:55:32 +0200 > > > From: Martin Pieuchot > > > > Hi Martin, > > > > Sorry, I'm a bit slow. Have concerts coming up so my OpenBSD time is > > a bit limited at the moment. > > No worries, thanks for your answer! > > > [...] > > > > It is important that when we flush the TLB, none of the threads in a > > > > process have the userland page tables active. On arm64 the CPUs can > > > > speculatively load TLB entries even if you don't reference the pages! > > > > The current code deactivates the page tables in cpu_exit() and uses > > > > atomics to make sure that the last thread that goes through cpu_exit() > > > > also flushes the TLB. At that point none of the threads can sleep, so > > > > we can simply set the TTBR0_EL1 register to point at a page filled > > > > with zeroes and don't have to worry about a context switch resetting > > > > TTBR0_EL1 to point at the userland page tables again. (We re-use the > > > > page filled with zeroes from the kernel pmap for that.) > > > > > > > > But uvm_purge() can sleep, so it needs to be called much earlier. We > > > > can prevent a context switch from reloading TTBR0_EL1 by also setting > > > > pm->pm_pt0a to point at that page filled with zeroes. But if we do > > > > that for any non-main thread, we run into problems because another > > > > thread that is still running userland code might context switch and > > > > end up in an endless loop faulting because it has a page table without > > > > valied entries in it. > > > > > > > > So that's why my new pmap_exit() function gets called in different > > > > places for the main thread and other threads. The main thread calls > > > > pmap_exit() at the place where your diff calls uvm_purge(), so it > > > > could be rolled into that function. > > > > > > > > I think this strategy will work for other pmaps as well, but I > > > > probably should look at one or two other ones. > > > > > > uvm_purge() is executed by the last thread in a process. When this > > > happens the other threads might still be at the end of exit1() but none > > > of them will go back to userland. > > > > > > I have other diffs to improve the synchronization between the siblings > > > of a process when exiting, mostly to remove unnecessary context switches. > > > They are built on the current assumption that uvm_purge() is called when > > > all other threads have cleaned their states. > > > This part of my work removes the notion of 'main thread' and the > > > P_THREAD flag. Instead the last thread of a process to enter exit1() > > > will clean the per-process states. > > > > > > Could you use those two pieces of information to simplify your diff? > > > > This suggests that I really should have two seperate functions, one > > which gets called for each thread exit (which disables the userland > > page tables) and one that gets called from uvm_purge() (which does the > > TLB flush and can clean up the pmap in the future). That way I don't > > have to rely on P_THREAD to determine what to do. > > I agree. > > > The latter function should probably be called pmap_purge() and it is > > fine if we call it for the "last thread in the process" instead of > > what we currently consider the "main thread". But this function still > > needs to make sure it runs after the other threads have disabled their > > userland page tables. And as you point out, at the point where > > uvm_purge() gets called the other threads might still be at the tail > > end of exit1(). > > > > I'm a bit hesitant to add yet another "public" pmap interface, so it > > would be nice to have cpu_exit() handle the MD-specifics of disabling > > the userland page tables. I could put back the atomics to manipulate > > pm_active and keep that code in pmap_deactivate(). The ordering issue > > between calling uvm_purge() and the other threads going through > > cpu_exit() is largely theoretical as there isn't much code between the > > wakup(&pr->ps_threads) and the cpu_exit() call. And I could make > > pmap_purge() block until pm_active becomes 1 to make sure the other > > threads have gone through cpu_exit(). > > I agree being able to use pmap_deactivate() is cleaner. > > I'd like to avoid adding another barrier in exit1(). I'm actually > working hard to remove them as much as possible to reduce existing > latency. The added "barrier" in pmap_purge() would be something like: while (pm->pm_active != 1) CPU_BUSY_CYCLE; And I don't think we'd execute that loop under normal conditions. We could put that in as a temporary measure until the rest of this is figured out to make sure arm64 isn't blocking your progress. We can replace the while loop with a KASSERT when the time comes to do that. Alternatively we could skip the pmap_purge() optimization if the other threads haven't gone through cpu_exit() yet. > I can send my other diff with the signaling to wake up the last thread > after cpu_exit(). This will require some plumbing to move sched_exit() > out of cpu_exit(). Yes, I think moving the sched_exit() out of cpu_exit() and into exit1() would make sense. I wonder if we should try to move pmap_deactivate() into exit1() as well? Almost all cpu_exit() implementations call it just before sched_exit(). Anyway, diff below is something I'm happier with. Index: arch/arm64/arm64/pmap.c =================================================================== RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v diff -u -p -r1.111 pmap.c --- arch/arm64/arm64/pmap.c 14 Feb 2025 18:36:04 -0000 1.111 +++ arch/arm64/arm64/pmap.c 15 May 2025 12:33:21 -0000 @@ -1497,14 +1497,33 @@ pmap_deactivate(struct proc *p) KASSERT(p == curproc); + if (pm->pm_active == 0) + return; + WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa); __asm volatile("isb"); - if (atomic_dec_int_nv(&pm->pm_active) > 0) - return; + atomic_dec_int(&pm->pm_active); +} + +void +pmap_purge(struct proc *p) +{ + pmap_t pm = p->p_vmspace->vm_map.pmap; + + KASSERT(p->p_p->ps_threadcnt == 0); + KASSERT(p == curproc); + + while (pm->pm_active != 1) + CPU_BUSY_CYCLE(); + + WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa); + __asm volatile("isb"); cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48); cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48); + pm->pm_pt0pa = pmap_kernel()->pm_pt0pa; + pm->pm_active = 0; } /* Index: arch/arm64/include/pmap.h =================================================================== RCS file: /cvs/src/sys/arch/arm64/include/pmap.h,v diff -u -p -r1.28 pmap.h --- arch/arm64/include/pmap.h 3 Feb 2025 17:59:40 -0000 1.28 +++ arch/arm64/include/pmap.h 15 May 2025 12:33:21 -0000 @@ -124,6 +124,7 @@ int pmap_fault_fixup(pmap_t, vaddr_t, vm #define __HAVE_PMAP_MPSAFE_ENTER_COW #define __HAVE_PMAP_POPULATE +#define __HAVE_PMAP_PURGE #endif /* _KERNEL && !_LOCORE */ Index: kern/kern_exit.c =================================================================== RCS file: /cvs/src/sys/kern/kern_exit.c,v diff -u -p -r1.245 kern_exit.c --- kern/kern_exit.c 2 May 2025 05:04:38 -0000 1.245 +++ kern/kern_exit.c 15 May 2025 12:33:23 -0000 @@ -241,6 +241,10 @@ exit1(struct proc *p, int xexit, int xsi */ if (pr->ps_pptr->ps_sigacts->ps_sigflags & SAS_NOCLDWAIT) atomic_setbits_int(&pr->ps_flags, PS_NOZOMBIE); + +#ifdef __HAVE_PMAP_PURGE + pmap_purge(p); +#endif } p->p_fd = NULL; /* zap the thread's copy */ Index: uvm/uvm_pmap.h =================================================================== RCS file: /cvs/src/sys/uvm/uvm_pmap.h,v diff -u -p -r1.35 uvm_pmap.h --- uvm/uvm_pmap.h 18 Jan 2025 16:35:31 -0000 1.35 +++ uvm/uvm_pmap.h 15 May 2025 12:33:23 -0000 @@ -183,6 +183,10 @@ void pmap_virtual_space(vaddr_t *, vad void pmap_populate(pmap_t, vaddr_t); #endif +#if defined(__HAVE_PMAP_PURGE) +void pmap_purge(struct proc *); +#endif + /* nested pmaps are used in i386/amd64 vmm */ #ifndef pmap_nested #define pmap_nested(pm) 0