Index | Thread | Search

From:
Mark Kettenis <mark.kettenis@xs4all.nl>
Subject:
Re: sparc64 pmap (was: Re: Please test: parallel fault handling)
To:
Jeremie Courreges-Anglas <jca@wxcvbn.org>
Cc:
tech@openbsd.org, miod@online.fr
Date:
Tue, 27 May 2025 15:26:29 +0200

Download raw body.

Thread
> Date: Mon, 26 May 2025 13:38:39 +0200
> From: Jeremie Courreges-Anglas <jca@wxcvbn.org>

Hi Jeremie,

> On Mon, May 26, 2025 at 09:23:02AM +0200, Jeremie Courreges-Anglas wrote:
> > On Mon, May 26, 2025 at 12:04:59AM +0200, Mark Kettenis wrote:
> > > 
> > > Can you try the diff below.  I've nuked pmap_collect() on other
> > > architectures since it is hard to make "mpsafe".  The current sparc64
> > > certainly isn't.
> > > 
> > > If this fixes the crashes, I can see if it is possible to make it
> > > "mpsafe" on sparc64.
> > 
> > I see.  Indeed this function stood out when I looked at the file.  Can
> > I really just empty it out like this, or should I also drop
> > __HAVE_PMAP_COLLECT?
> > 
> > Yesterday the machine crashed again in a nearby place in
> > pmap_page_protect().

That is still without my pmap_collect() diff isn't it?

> > login: pmap_page_protect: gotten pseg empty!
> > Stopped at      pmap_page_protect+0x620:        nop
> > ddb{4}> tr
> > uvm_pagedeactivate(4000d1510a0, 19d2ce0, 193edd0, 40015a507e4, 1, 1) at uvm_pagedeactivate+0x54
> > uvn_flush(401053a3310, 0, 11c8000, 14, 1, 0) at uvn_flush+0x448
> > uvn_detach(401053a3310, 40101633630, 1, 0, 0, 1000000) at uvn_detach+0x158
> > uvm_unmap_detach(400fe6b5c68, 0, 9a7, 40015a507e4, 40015a507e4, 18f7828) at uvm_unmap_detach+0x68
> > uvm_map_teardown(400130912e0, 4000, 1939240, 4010599c000, 54d, 17b2ff8) at uvm_map_teardown+0x184
> > uvmspace_free(1cd7000, 400130912e0, 4, 17b2ff8, 0, 6) at uvmspace_free+0x64
> > reaper(40015a507e0, 40015a507e0, db, 100, 1c04038, 4000) at reaper+0x100
> > proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10
> > ddb{4}>
> > 
> > This time it's pmap.c l.2534:
> > 
> >                 /* Then remove the primary pv */
> >                 if (pv->pv_pmap != NULL) {
> >                         data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> > 
> >                         /* Save REF/MOD info */
> >                         pv->pv_va |= pmap_tte2flags(data);
> >                         if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) {
> >                                 printf("pmap_page_protect: gotten pseg empty!\n");
> > -->                             db_enter();
> >                                 /* panic? */
> >                         }
> > 
> > It's not the first crash with uvmspace_free() going through
> > uvm_pagedeactivate(), so I agree with you that a fix is probably
> > needed for mpi's uvm_purge() diff (which I haven't tested yet on that
> > box).
> 
> FWIW pmap_page_protect() itself looks like a potential problem, as you
> suspected.  The diff below locks the pmaps in pmap_page_protect(PROT_NONE)
> which is used by uvm_pagedeactivate.

I don't rule out issues in pmap_page_protect() yet.  However, this
could still be another manifestation of the issue with pmap_collect().

> Walking the pv list and locking the pmap introduces a lock ordering
> constraint, something that the arm64 and riscv64 implems handle with a
> slightly convoluted approach.  Surely sparc64 should have its
> pseg_get/set calls protected by the pmap lock too.  Now, sparc64 is
> special in several regards: each mapping must be checked for mod/ref
> information by accessing psegs, pv list handling is inlined, and the
> first element in the pv list is embedded in struct mdpage.
> 
> The diff below drops the current pv entry from the list before acting
> on it, thus avoiding concurrent access and the "start all over again"
> idiom.  For the "first" pv entry, we can make a copy of it so that we
> can drop the pv lock.  It could be argued that this static pv
> header/entry is making things more complex than they need to be.
> 
> The code keeps the mod/ref accounting.  But does it make sense to keep
> the mod/ref information at all, if we're flushing/destroying all the
> mappings to a page?  If it is indeed pointless, which I'm uncertain
> about, we could drop the pseg_get/mtx_enter/etc chunks.  The
> performance of the diff below is similar to before applying it (with
> the uvm parallel fault diff applied).
> 
> In this very function there's the "lower protection" code path that
> walks the pv list and access psegs.  Not sure yet how to tackle it,
> but maybe we don't need to handle it yet?
> 
> Thanks to miod for shedding some light on this pmap implementation
> over the week-end (he hasn't seen the diff yet).
> 
> Input welcome.

The pseg_get() and pseg_set() functions use atomic instructions to
manipulate the page tables, so they shouldn't need the pmap lock.  But
we can't remove pages from the page tables (which is what
pmap_collect() does), because then these function may read data from a
recycled page (or even modify it).  Note that this isn't just a
problem for pseg_get() and pseg_set(), but also for the "inline
pseg_get()" that is part of low-level fault handlers.

But there is a potential race with the REF/MOD bits that may need some
attenion.  I'l have another look at this diff soon.

> Index: pmap.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v
> diff -u -p -r1.121 pmap.c
> --- pmap.c	26 Jun 2024 01:40:49 -0000	1.121
> +++ pmap.c	26 May 2025 08:26:39 -0000
> @@ -2489,6 +2489,7 @@ pmap_page_protect(struct vm_page *pg, vm
>  		mtx_leave(&pg->mdpage.pvmtx);
>  	} else {
>  		pv_entry_t firstpv;
> +		struct pv_entry pve;
>  		/* remove mappings */
>  
>  		firstpv = pa_to_pvh(pa);
> @@ -2496,10 +2497,19 @@ pmap_page_protect(struct vm_page *pg, vm
>  
>  		/* First remove the entire list of continuation pv's*/
>  		while ((pv = firstpv->pv_next) != NULL) {
> -			data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> +			/* Remove item from the list, avoids concurrent calls */
> +			firstpv->pv_next = pv->pv_next;
> +			pv->pv_next = NULL;
> +
> +			/* Unlock the pv lock to avoid ordering problems */
> +			mtx_leave(&pg->mdpage.pvmtx);
> +			mtx_enter(&pv->pv_pmap->pm_mtx);
>  
>  			/* Save REF/MOD info */
> +			data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> +			mtx_enter(&pg->mdpage.pvmtx);
>  			firstpv->pv_va |= pmap_tte2flags(data);
> +			mtx_leave(&pg->mdpage.pvmtx);
>  
>  			/* Clear mapping */
>  			if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) {
> @@ -2507,6 +2517,8 @@ pmap_page_protect(struct vm_page *pg, vm
>  				db_enter();
>  				/* panic? */
>  			}
> +			mtx_leave(&pv->pv_pmap->pm_mtx);
> +
>  			if (pv->pv_pmap->pm_ctx || pv->pv_pmap == pmap_kernel()) {
>  				tsb_invalidate(pv->pv_pmap->pm_ctx,
>  				    (pv->pv_va & PV_VAMASK));
> @@ -2515,25 +2527,38 @@ pmap_page_protect(struct vm_page *pg, vm
>  			atomic_dec_long(&pv->pv_pmap->pm_stats.resident_count);
>  
>  			/* free the pv */
> -			firstpv->pv_next = pv->pv_next;
> -			mtx_leave(&pg->mdpage.pvmtx);
>  			pool_put(&pv_pool, pv);
>  			mtx_enter(&pg->mdpage.pvmtx);
>  		}
>  
> -		pv = firstpv;
> +		/*
> +		 * Make a local copy of the primary pv data to guard
> +		 * against concurrent calls.
> +		 */
> +		pv = NULL;
> +		if (firstpv->pv_pmap != NULL) {
> +			pve = *firstpv;
> +			pv = &pve;
> +			firstpv->pv_pmap = NULL;
> +		}
> +		mtx_leave(&pg->mdpage.pvmtx);
> +
> +		if (pv != NULL) {
> +			mtx_enter(&pv->pv_pmap->pm_mtx);
>  
> -		/* Then remove the primary pv */
> -		if (pv->pv_pmap != NULL) {
> +			/* Save REF/MOD info */
>  			data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> +			mtx_enter(&pg->mdpage.pvmtx);
> +			firstpv->pv_va |= pmap_tte2flags(data);
> +			mtx_leave(&pg->mdpage.pvmtx);
>  
> -			/* Save REF/MOD info */
> -			pv->pv_va |= pmap_tte2flags(data);
> +			/* Clear mapping */
>  			if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) {
>  				printf("pmap_page_protect: gotten pseg empty!\n");
>  				db_enter();
>  				/* panic? */
>  			}
> +			mtx_leave(&pv->pv_pmap->pm_mtx);
>  			if (pv->pv_pmap->pm_ctx || pv->pv_pmap == pmap_kernel()) {
>  				tsb_invalidate(pv->pv_pmap->pm_ctx,
>  				    (pv->pv_va & PV_VAMASK));
> @@ -2541,13 +2566,9 @@ pmap_page_protect(struct vm_page *pg, vm
>  				    pv->pv_pmap->pm_ctx);
>  			}
>  			atomic_dec_long(&pv->pv_pmap->pm_stats.resident_count);
> -
> -			KASSERT(pv->pv_next == NULL);
> -			/* dump the first pv */
> -			pv->pv_pmap = NULL;
>  		}
> +
>  		dcache_flush_page(pa);
> -		mtx_leave(&pg->mdpage.pvmtx);
>  	}
>  	/* We should really only flush the pages we demapped. */
>  }
> 
> -- 
> jca
>