From: Mark Kettenis <mark.kettenis@xs4all.nl>
Subject: Re: uvm_purge()
To: Martin Pieuchot <mpi@grenadille.net>
Cc: tech@openbsd.org
Date: Thu, 15 May 2025 11:52:16 +0200

> Date: Wed, 14 May 2025 11:55:32 +0200
> From: Martin Pieuchot <mpi@grenadille.net>

Hi Martin,

Sorry, I'm a bit slow.  Have concerts coming up so my OpenBSD time is
a bit limited at the moment.

> On 13/05/25(Tue) 22:31, Mark Kettenis wrote:
> > > Date: Tue, 13 May 2025 14:21:44 +0200
> > > From: Martin Pieuchot <mpi@grenadille.net>
> > > 
> > > On 12/05/25(Mon) 22:00, Mark Kettenis wrote:
> > > > > Date: Mon, 12 May 2025 20:42:40 +0200
> > > > > From: Martin Pieuchot <mpi@grenadille.net>
> > > > 
> > > > Hi Martin,
> > > > 
> > > > > Diff below moves the tearing down of VM space to exit1().  It implies
> > > > > that processes will now be charged for cleaning their VM space.  As a
> > > > > result their %sys time will increase.
> > > > > 
> > > > > uvm_purge() is called in the "top" part of exit1() when the process is
> > > > > still allowed to sleep.  To allow this piece of code to be executed in
> > > > > parallel, all recursive kernel_lock tokens need to be released. 
> > > > > 
> > > > > In the diff below uvm_purge() is called twice.  The first time in
> > > > > exit1() and the second time, doing nothing but integrity checks, in
> > > > > uvm_exit().  Another approach could be to just check that the map is
> > > > > empty.  Any preferences?
> > > > > 
> > > > > Comments?  Oks?
> > > > 
> > > > As I explained to Claudio in the "Faster _exit(2) for a faster
> > > > userland: R.I.P the reaper" thread, this will disable the TLB flushing
> > > > optimization I added on arm64 a few months ago.  The uvm_purge() call
> > > > needs to come after the userland page tables have been disabled and
> > > > the TLBs have been flushed.
> > > > 
> > > > Unfortunately the way I hooked that operation into pmap_deactivate()
> > > > makes this non-trivial.  The easiest way out may be to add a new
> > > > pmap_purge() that does this operation and that gets called from an
> > > > appropriate spot exit1().  This must happen after the other threads of
> > > > the process have exited, so somewhere after the point where the main
> > > > threads waits for them to die.
> > > 
> > > As long as this only touches userland page tables it can be inside
> > > uvm_purge().  
> > > 
> > > > I can cook you a diff for that.
> > > 
> > > Thanks!
> > > 
> > 
> > So here is a proof of concept; not sure I'm entirely happy with this
> > yet.  But hopefully it gets across what's the challenge.
> 
> Lovely, more inputs below that hopefully can help simplify this proof of
> concept.
> 
> > 
> > It is important that when we flush the TLB, none of the threads in a
> > process have the userland page tables active.  On arm64 the CPUs can
> > speculatively load TLB entries even if you don't reference the pages!
> > The current code deactivates the page tables in cpu_exit() and uses
> > atomics to make sure that the last thread that goes through cpu_exit()
> > also flushes the TLB.  At that point none of the threads can sleep, so
> > we can simply set the TTBR0_EL1 register to point at a page filled
> > with zeroes and don't have to worry about a context switch resetting
> > TTBR0_EL1 to point at the userland page tables again.  (We re-use the
> > page filled with zeroes from the kernel pmap for that.)
> > 
> > But uvm_purge() can sleep, so it needs to be called much earlier.  We
> > can prevent a context switch from reloading TTBR0_EL1 by also setting
> > pm->pm_pt0a to point at that page filled with zeroes.  But if we do
> > that for any non-main thread, we run into problems because another
> > thread that is still running userland code might context switch and
> > end up in an endless loop faulting because it has a page table without
> > valied entries in it.
> >
> > So that's why my new pmap_exit() function gets called in different
> > places for the main thread and other threads.  The main thread calls
> > pmap_exit() at the place where your diff calls uvm_purge(), so it
> > could be rolled into that function.
> > 
> > I think this strategy will work for other pmaps as well, but I
> > probably should look at one or two other ones.
> 
> uvm_purge() is executed by the last thread in a process.  When this
> happens the other threads might still be at the end of exit1() but none
> of them will go back to userland.
> 
> I have other diffs to improve the synchronization between the siblings
> of a process when exiting, mostly to remove unnecessary context switches.
> They are built on the current assumption that uvm_purge() is called when
> all other threads have cleaned their states.
> This part of my work removes the notion of 'main thread' and the
> P_THREAD flag.  Instead the last thread of a process to enter exit1()
> will clean the per-process states.  
> 
> Could you use those two pieces of information to simplify your diff?

This suggests that I really should have two seperate functions, one
which gets called for each thread exit (which disables the userland
page tables) and one that gets called from uvm_purge() (which does the
TLB flush and can clean up the pmap in the future).  That way I don't
have to rely on P_THREAD to determine what to do.

The latter function should probably be called pmap_purge() and it is
fine if we call it for the "last thread in the process" instead of
what we currently consider the "main thread".  But this function still
needs to make sure it runs after the other threads have disabled their
userland page tables.  And as you point out, at the point where
uvm_purge() gets called the other threads might still be at the tail
end of exit1().

I'm a bit hesitant to add yet another "public" pmap interface, so it
would be nice to have cpu_exit() handle the MD-specifics of disabling
the userland page tables.  I could put back the atomics to manipulate
pm_active and keep that code in pmap_deactivate().  The ordering issue
between calling uvm_purge() and the other threads going through
cpu_exit() is largely theoretical as there isn't much code between the
wakup(&pr->ps_threads) and the cpu_exit() call.  And I could make
pmap_purge() block until pm_active becomes 1 to make sure the other
threads have gone through cpu_exit().


> > Index: arch/arm64/arm64/pmap.c
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/arm64/arm64/pmap.c,v
> > diff -u -p -r1.111 pmap.c
> > --- arch/arm64/arm64/pmap.c	14 Feb 2025 18:36:04 -0000	1.111
> > +++ arch/arm64/arm64/pmap.c	13 May 2025 20:09:45 -0000
> > @@ -946,6 +946,7 @@ pmap_pinit(pmap_t pm)
> >  	pmap_extract(pmap_kernel(), l0va, (paddr_t *)&pm->pm_pt0pa);
> >  
> >  	pmap_reference(pm);
> > +	pm->pm_active = 1;
> >  }
> >  
> >  int pmap_vp_poolcache = 0; /* force vp poolcache to allocate late */
> > @@ -1482,7 +1483,6 @@ pmap_activate(struct proc *p)
> >  {
> >  	pmap_t pm = p->p_vmspace->vm_map.pmap;
> >  
> > -	atomic_inc_int(&pm->pm_active);
> >  	if (p == curproc && pm != curcpu()->ci_curpm)
> >  		pmap_setttb(p);
> >  }
> > @@ -1493,6 +1493,11 @@ pmap_activate(struct proc *p)
> >  void
> >  pmap_deactivate(struct proc *p)
> >  {
> > +}
> > +
> > +void
> > +pmap_exit(struct proc *p)
> > +{
> >  	pmap_t pm = p->p_vmspace->vm_map.pmap;
> >  
> >  	KASSERT(p == curproc);
> > @@ -1500,11 +1505,15 @@ pmap_deactivate(struct proc *p)
> >  	WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa);
> >  	__asm volatile("isb");
> >  
> > -	if (atomic_dec_int_nv(&pm->pm_active) > 0)
> > +	if ((p->p_flag & P_THREAD))
> >  		return;
> >  
> > +	KASSERT(p->p_p->ps_threadcnt == 0);
> > +
> >  	cpu_tlb_flush_asid_all((uint64_t)pm->pm_asid << 48);
> >  	cpu_tlb_flush_asid_all((uint64_t)(pm->pm_asid | ASID_USER) << 48);
> > +	pm->pm_pt0pa = pmap_kernel()->pm_pt0pa;
> > +	pm->pm_active = 0;
> >  }
> >  
> >  /*
> > Index: arch/arm64/include/pmap.h
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/arm64/include/pmap.h,v
> > diff -u -p -r1.28 pmap.h
> > --- arch/arm64/include/pmap.h	3 Feb 2025 17:59:40 -0000	1.28
> > +++ arch/arm64/include/pmap.h	13 May 2025 20:09:45 -0000
> > @@ -125,6 +125,8 @@ int	pmap_fault_fixup(pmap_t, vaddr_t, vm
> >  #define __HAVE_PMAP_MPSAFE_ENTER_COW
> >  #define __HAVE_PMAP_POPULATE
> >  
> > +void	pmap_exit(struct proc *);
> > +
> >  #endif /* _KERNEL && !_LOCORE */
> >  
> >  #ifndef _LOCORE
> > Index: kern/kern_exit.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/kern_exit.c,v
> > diff -u -p -r1.245 kern_exit.c
> > --- kern/kern_exit.c	2 May 2025 05:04:38 -0000	1.245
> > +++ kern/kern_exit.c	13 May 2025 20:09:48 -0000
> > @@ -241,6 +241,8 @@ exit1(struct proc *p, int xexit, int xsi
> >  		 */
> >  		if (pr->ps_pptr->ps_sigacts->ps_sigflags & SAS_NOCLDWAIT)
> >  			atomic_setbits_int(&pr->ps_flags, PS_NOZOMBIE);
> > +
> > +		pmap_exit(p);
> >  	}
> >  
> >  	p->p_fd = NULL;		/* zap the thread's copy */
> > @@ -369,6 +371,7 @@ exit1(struct proc *p, int xexit, int xsi
> >  
> >  	/* just a thread? check if last one standing. */
> >  	if (p->p_flag & P_THREAD) {
> > +		pmap_exit(p);
> >  		/* scheduler_wait_hook(pr->ps_mainproc, p); XXX */
> >  		mtx_enter(&pr->ps_mtx);
> >  		pr->ps_exitcnt--;
> 
> 
>