From: Mark Kettenis <mark.kettenis@xs4all.nl>
Subject: Re: uvm_purge()
To: Martin Pieuchot <mpi@grenadille.net>
Cc: claudio@openbsd.org, tech@openbsd.org
Date: Thu, 15 May 2025 14:45:56 +0200

> Date: Thu, 15 May 2025 12:34:16 +0200
> From: Martin Pieuchot <mpi@grenadille.net>
> 
> On 15/05/25(Thu) 11:52, Mark Kettenis wrote:
> > > Date: Wed, 14 May 2025 11:55:32 +0200
> > > From: Martin Pieuchot <mpi@grenadille.net>
> > 
> > Hi Martin,
> > 
> > Sorry, I'm a bit slow.  Have concerts coming up so my OpenBSD time is
> > a bit limited at the moment.
> 
> No worries, thanks for your answer!
> 
> > [...]
> > > > It is important that when we flush the TLB, none of the threads in a
> > > > process have the userland page tables active.  On arm64 the CPUs can
> > > > speculatively load TLB entries even if you don't reference the pages!
> > > > The current code deactivates the page tables in cpu_exit() and uses
> > > > atomics to make sure that the last thread that goes through cpu_exit()
> > > > also flushes the TLB.  At that point none of the threads can sleep, so
> > > > we can simply set the TTBR0_EL1 register to point at a page filled
> > > > with zeroes and don't have to worry about a context switch resetting
> > > > TTBR0_EL1 to point at the userland page tables again.  (We re-use the
> > > > page filled with zeroes from the kernel pmap for that.)
> > > > 
> > > > But uvm_purge() can sleep, so it needs to be called much earlier.  We
> > > > can prevent a context switch from reloading TTBR0_EL1 by also setting
> > > > pm->pm_pt0a to point at that page filled with zeroes.  But if we do
> > > > that for any non-main thread, we run into problems because another
> > > > thread that is still running userland code might context switch and
> > > > end up in an endless loop faulting because it has a page table without
> > > > valied entries in it.
> > > >
> > > > So that's why my new pmap_exit() function gets called in different
> > > > places for the main thread and other threads.  The main thread calls
> > > > pmap_exit() at the place where your diff calls uvm_purge(), so it
> > > > could be rolled into that function.
> > > > 
> > > > I think this strategy will work for other pmaps as well, but I
> > > > probably should look at one or two other ones.
> > > 
> > > uvm_purge() is executed by the last thread in a process.  When this
> > > happens the other threads might still be at the end of exit1() but none
> > > of them will go back to userland.
> > > 
> > > I have other diffs to improve the synchronization between the siblings
> > > of a process when exiting, mostly to remove unnecessary context switches.
> > > They are built on the current assumption that uvm_purge() is called when
> > > all other threads have cleaned their states.
> > > This part of my work removes the notion of 'main thread' and the
> > > P_THREAD flag.  Instead the last thread of a process to enter exit1()
> > > will clean the per-process states.  
> > > 
> > > Could you use those two pieces of information to simplify your diff?
> > 
> > This suggests that I really should have two seperate functions, one
> > which gets called for each thread exit (which disables the userland
> > page tables) and one that gets called from uvm_purge() (which does the
> > TLB flush and can clean up the pmap in the future).  That way I don't
> > have to rely on P_THREAD to determine what to do.
> 
> I agree.
> 
> > The latter function should probably be called pmap_purge() and it is
> > fine if we call it for the "last thread in the process" instead of
> > what we currently consider the "main thread".  But this function still
> > needs to make sure it runs after the other threads have disabled their
> > userland page tables.  And as you point out, at the point where
> > uvm_purge() gets called the other threads might still be at the tail
> > end of exit1().
> > 
> > I'm a bit hesitant to add yet another "public" pmap interface, so it
> > would be nice to have cpu_exit() handle the MD-specifics of disabling
> > the userland page tables.  I could put back the atomics to manipulate
> > pm_active and keep that code in pmap_deactivate().  The ordering issue
> > between calling uvm_purge() and the other threads going through
> > cpu_exit() is largely theoretical as there isn't much code between the
> > wakup(&pr->ps_threads) and the cpu_exit() call.  And I could make
> > pmap_purge() block until pm_active becomes 1 to make sure the other
> > threads have gone through cpu_exit().
> 
> I agree being able to use pmap_deactivate() is cleaner.
> 
> I'd like to avoid adding another barrier in exit1().  I'm actually
> working hard to remove them as much as possible to reduce existing
> latency.

The added "barrier" in pmap_purge() would be something like:

        while (pm->pm_active != 1)
	        CPU_BUSY_CYCLE;

And I don't think we'd execute that loop under normal conditions.  We
could put that in as a temporary measure until the rest of this is
figured out to make sure arm64 isn't blocking your progress.  We can
replace the while loop with a KASSERT when the time comes to do that.

Alternatively we could skip the pmap_purge() optimization if the other
threads haven't gone through cpu_exit() yet.

> I can send my other diff with the signaling to wake up the last thread
> after cpu_exit().  This will require some plumbing to move sched_exit()
> out of cpu_exit().

Yes, I think moving the sched_exit() out of cpu_exit() and into
exit1() would make sense.  I wonder if we should try to move
pmap_deactivate() into exit1() as well?  Almost all cpu_exit()
implementations call it just before sched_exit().

Anyway, diff below is something I'm happier with.

Index: arch/arm64/arm64/pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
diff -u -p -r1.111 pmap.c
--- arch/arm64/arm64/pmap.c	14 Feb 2025 18:36:04 -0000	1.111
+++ arch/arm64/arm64/pmap.c	15 May 2025 12:33:21 -0000
@@ -1497,14 +1497,33 @@ pmap_deactivate(struct proc *p)
 
 	KASSERT(p == curproc);
 
+	if (pm->pm_active == 0)
+		return;
+
 	WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa);
 	__asm volatile("isb");
 
-	if (atomic_dec_int_nv(&pm->pm_active) > 0)
-		return;
+	atomic_dec_int(&pm->pm_active);
+}
+
+void
+pmap_purge(struct proc *p)
+{
+	pmap_t pm = p->p_vmspace->vm_map.pmap;
+
+	KASSERT(p->p_p->ps_threadcnt == 0);
+	KASSERT(p == curproc);
+
+	while (pm->pm_active != 1)
+		CPU_BUSY_CYCLE();
+
+	WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa);
+	__asm volatile("isb");
 
 	cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48);
 	cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48);
+	pm->pm_pt0pa = pmap_kernel()->pm_pt0pa;
+	pm->pm_active = 0;
 }
 
 /*
Index: arch/arm64/include/pmap.h
===================================================================
RCS file: /cvs/src/sys/arch/arm64/include/pmap.h,v
diff -u -p -r1.28 pmap.h
--- arch/arm64/include/pmap.h	3 Feb 2025 17:59:40 -0000	1.28
+++ arch/arm64/include/pmap.h	15 May 2025 12:33:21 -0000
@@ -124,6 +124,7 @@ int	pmap_fault_fixup(pmap_t, vaddr_t, vm
 
 #define __HAVE_PMAP_MPSAFE_ENTER_COW
 #define __HAVE_PMAP_POPULATE
+#define __HAVE_PMAP_PURGE
 
 #endif /* _KERNEL && !_LOCORE */
 
Index: kern/kern_exit.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_exit.c,v
diff -u -p -r1.245 kern_exit.c
--- kern/kern_exit.c	2 May 2025 05:04:38 -0000	1.245
+++ kern/kern_exit.c	15 May 2025 12:33:23 -0000
@@ -241,6 +241,10 @@ exit1(struct proc *p, int xexit, int xsi
 		 */
 		if (pr->ps_pptr->ps_sigacts->ps_sigflags & SAS_NOCLDWAIT)
 			atomic_setbits_int(&pr->ps_flags, PS_NOZOMBIE);
+
+#ifdef __HAVE_PMAP_PURGE
+		pmap_purge(p);
+#endif
 	}
 
 	p->p_fd = NULL;		/* zap the thread's copy */
Index: uvm/uvm_pmap.h
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_pmap.h,v
diff -u -p -r1.35 uvm_pmap.h
--- uvm/uvm_pmap.h	18 Jan 2025 16:35:31 -0000	1.35
+++ uvm/uvm_pmap.h	15 May 2025 12:33:23 -0000
@@ -183,6 +183,10 @@ void		 pmap_virtual_space(vaddr_t *, vad
 void		pmap_populate(pmap_t, vaddr_t);
 #endif
 
+#if defined(__HAVE_PMAP_PURGE)
+void		pmap_purge(struct proc *);
+#endif
+
 /* nested pmaps are used in i386/amd64 vmm */
 #ifndef pmap_nested
 #define pmap_nested(pm) 0