Index | Thread | Search

From:
Jeremie Courreges-Anglas <jca@wxcvbn.org>
Subject:
sparc64 pmap (was: Re: Please test: parallel fault handling)
To:
Mark Kettenis <mark.kettenis@xs4all.nl>, tech@openbsd.org
Cc:
Miod Vallat <miod@online.fr>
Date:
Mon, 26 May 2025 13:38:39 +0200

Download raw body.

Thread
On Mon, May 26, 2025 at 09:23:02AM +0200, Jeremie Courreges-Anglas wrote:
> On Mon, May 26, 2025 at 12:04:59AM +0200, Mark Kettenis wrote:
> > > Date: Sun, 25 May 2025 23:20:46 +0200
> > > From: Jeremie Courreges-Anglas <jca@wxcvbn.org>
> > > 
> > > On Thu, May 22, 2025 at 08:19:38PM +0200, Mark Kettenis wrote:
> > > > > Date: Thu, 22 May 2025 18:54:08 +0200
> > > > > From: Jeremie Courreges-Anglas <jca@wxcvbn.org>
> > > [...]
> > > > > *Bzzzt*
> > > > > 
> > > > > The same LDOM was busy compiling two devel/llvm copies under dpb(1).
> > > > > Input welcome, I'm not sure yet what other ddb commands could help.
> > > > > 
> > > > > login: panic: trap type 0x34 (mem address not aligned): pc=1012f68 npc=1012f6c pstate=820006<PRIV,IE>
> > > > > Stopped at      db_enter+0x8:   nop
> > > > >     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> > > > >   57488   1522      0        0x11          0    1  perl
> > > > >  435923   9891     55   0x1000002          0    4  cc1plus
> > > > >  135860  36368     55   0x1000002          0   13  cc1plus
> > > > >  333743  96489     55   0x1000002          0    0  cc1plus
> > > > >  433162  55422     55   0x1000002          0    9  cc1plus
> > > > >  171658  49723     55   0x1000002          0    5  cc1plus
> > > > >   47127  57536     55   0x1000002          0   10  cc1plus
> > > > >   56600   9350     55   0x1000002          0   14  cc1plus
> > > > >  159792  13842     55   0x1000002          0    6  cc1plus
> > > > >  510019  10312     55   0x1000002          0    8  cc1plus
> > > > >   20489  65709     55   0x1000002          0   15  cc1plus
> > > > >  337455  42430     55   0x1000002          0   12  cc1plus
> > > > >  401407  80906     55   0x1000002          0   11  cc1plus
> > > > >   22993  62317     55   0x1000002          0    2  cc1plus
> > > > >  114916  17058     55   0x1000002          0    7  cc1plus
> > > > > *435412  33034      0     0x14000      0x200    3K pagedaemon
> > > > > trap(400fe6b19b0, 34, 1012f68, 820006, 3, 42) at trap+0x334
> > > > > Lslowtrap_reenter(40015a58a00, 77b5db2000, deadbeefdeadc0c7, 1d8, 2df0fc468, 468) at Lslowtrap_reenter+0xf8
> > > > > pmap_page_protect(40010716ab8, c16, 1cc9860, 193dfa0, 1cc9000, 1cc9000) at pmap_page_protect+0x1fc
> > > > > uvm_pagedeactivate(40010716a50, 40015a50d24, 18667a0, 0, 0, 1c8dac0) at uvm_pagedeactivate+0x54
> > > > > uvmpd_scan_active(0, 0, 270f2, 18667a0, 0, ffffffffffffffff) at uvmpd_scan_active+0x150
> > > > > uvm_pageout(400fe6b1e08, 55555556, 18667a0, 1c83f08, 1c83000, 1c8dc18) at uvm_pageout+0x2dc
> > > > > proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10
> > > > > https://www.openbsd.org/ddb.html describes the minimum info required in bug
> > > > > reports.  Insufficient info makes it difficult to find and fix bugs.
> > > > 
> > > > If there are pmap issues, pmap_page_protect() is certainly the first
> > > > place I'd look.  I'll start looking, but don't expect to have much
> > > > time until after monday.
> > > 
> > > Indeed this crash lies in pmap_page_protect().  llvm-objdump -dlS says
> > > it's stopped at l.2499:
> > > 
> > >         } else {
> > >                 pv_entry_t firstpv;
> > >                 /* remove mappings */
> > > 
> > >                 firstpv = pa_to_pvh(pa);
> > >                 mtx_enter(&pg->mdpage.pvmtx);
> > > 
> > >                 /* First remove the entire list of continuation pv's*/
> > >                 while ((pv = firstpv->pv_next) != NULL) {
> > > -->                     data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> > > 
> > >                         /* Save REF/MOD info */
> > >                         firstpv->pv_va |= pmap_tte2flags(data);
> > > 
> > > ; /sys/arch/sparc64/sparc64/pmap.c:2499
> > > ;                       data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> > >     3c10: a7 29 30 0d   sllx %g4, 13, %l3
> > >     3c14: d2 5c 60 10   ldx [%l1+16], %o1
> > >     3c18: d0 5c 60 08   ldx [%l1+8], %o0
> > > --> 3c1c: 40 00 00 00   call 0
> > >     3c20: 92 0a 40 13   and %o1, %l3, %o1
> > > 
> > > As discussed with miod I suspect the crash actually lies inside
> > > pseg_get(), but I can't prove it.
> > 
> > Can you try the diff below.  I've nuked pmap_collect() on other
> > architectures since it is hard to make "mpsafe".  The current sparc64
> > certainly isn't.
> > 
> > If this fixes the crashes, I can see if it is possible to make it
> > "mpsafe" on sparc64.
> 
> I see.  Indeed this function stood out when I looked at the file.  Can
> I really just empty it out like this, or should I also drop
> __HAVE_PMAP_COLLECT?
> 
> Yesterday the machine crashed again in a nearby place in
> pmap_page_protect().
> 
> login: pmap_page_protect: gotten pseg empty!
> Stopped at      pmap_page_protect+0x620:        nop
> ddb{4}> tr
> uvm_pagedeactivate(4000d1510a0, 19d2ce0, 193edd0, 40015a507e4, 1, 1) at uvm_pagedeactivate+0x54
> uvn_flush(401053a3310, 0, 11c8000, 14, 1, 0) at uvn_flush+0x448
> uvn_detach(401053a3310, 40101633630, 1, 0, 0, 1000000) at uvn_detach+0x158
> uvm_unmap_detach(400fe6b5c68, 0, 9a7, 40015a507e4, 40015a507e4, 18f7828) at uvm_unmap_detach+0x68
> uvm_map_teardown(400130912e0, 4000, 1939240, 4010599c000, 54d, 17b2ff8) at uvm_map_teardown+0x184
> uvmspace_free(1cd7000, 400130912e0, 4, 17b2ff8, 0, 6) at uvmspace_free+0x64
> reaper(40015a507e0, 40015a507e0, db, 100, 1c04038, 4000) at reaper+0x100
> proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10
> ddb{4}>
> 
> This time it's pmap.c l.2534:
> 
>                 /* Then remove the primary pv */
>                 if (pv->pv_pmap != NULL) {
>                         data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> 
>                         /* Save REF/MOD info */
>                         pv->pv_va |= pmap_tte2flags(data);
>                         if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) {
>                                 printf("pmap_page_protect: gotten pseg empty!\n");
> -->                             db_enter();
>                                 /* panic? */
>                         }
> 
> It's not the first crash with uvmspace_free() going through
> uvm_pagedeactivate(), so I agree with you that a fix is probably
> needed for mpi's uvm_purge() diff (which I haven't tested yet on that
> box).

FWIW pmap_page_protect() itself looks like a potential problem, as you
suspected.  The diff below locks the pmaps in pmap_page_protect(PROT_NONE)
which is used by uvm_pagedeactivate.

Walking the pv list and locking the pmap introduces a lock ordering
constraint, something that the arm64 and riscv64 implems handle with a
slightly convoluted approach.  Surely sparc64 should have its
pseg_get/set calls protected by the pmap lock too.  Now, sparc64 is
special in several regards: each mapping must be checked for mod/ref
information by accessing psegs, pv list handling is inlined, and the
first element in the pv list is embedded in struct mdpage.

The diff below drops the current pv entry from the list before acting
on it, thus avoiding concurrent access and the "start all over again"
idiom.  For the "first" pv entry, we can make a copy of it so that we
can drop the pv lock.  It could be argued that this static pv
header/entry is making things more complex than they need to be.

The code keeps the mod/ref accounting.  But does it make sense to keep
the mod/ref information at all, if we're flushing/destroying all the
mappings to a page?  If it is indeed pointless, which I'm uncertain
about, we could drop the pseg_get/mtx_enter/etc chunks.  The
performance of the diff below is similar to before applying it (with
the uvm parallel fault diff applied).

In this very function there's the "lower protection" code path that
walks the pv list and access psegs.  Not sure yet how to tackle it,
but maybe we don't need to handle it yet?

Thanks to miod for shedding some light on this pmap implementation
over the week-end (he hasn't seen the diff yet).

Input welcome.


Index: pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v
diff -u -p -r1.121 pmap.c
--- pmap.c	26 Jun 2024 01:40:49 -0000	1.121
+++ pmap.c	26 May 2025 08:26:39 -0000
@@ -2489,6 +2489,7 @@ pmap_page_protect(struct vm_page *pg, vm
 		mtx_leave(&pg->mdpage.pvmtx);
 	} else {
 		pv_entry_t firstpv;
+		struct pv_entry pve;
 		/* remove mappings */
 
 		firstpv = pa_to_pvh(pa);
@@ -2496,10 +2497,19 @@ pmap_page_protect(struct vm_page *pg, vm
 
 		/* First remove the entire list of continuation pv's*/
 		while ((pv = firstpv->pv_next) != NULL) {
-			data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
+			/* Remove item from the list, avoids concurrent calls */
+			firstpv->pv_next = pv->pv_next;
+			pv->pv_next = NULL;
+
+			/* Unlock the pv lock to avoid ordering problems */
+			mtx_leave(&pg->mdpage.pvmtx);
+			mtx_enter(&pv->pv_pmap->pm_mtx);
 
 			/* Save REF/MOD info */
+			data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
+			mtx_enter(&pg->mdpage.pvmtx);
 			firstpv->pv_va |= pmap_tte2flags(data);
+			mtx_leave(&pg->mdpage.pvmtx);
 
 			/* Clear mapping */
 			if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) {
@@ -2507,6 +2517,8 @@ pmap_page_protect(struct vm_page *pg, vm
 				db_enter();
 				/* panic? */
 			}
+			mtx_leave(&pv->pv_pmap->pm_mtx);
+
 			if (pv->pv_pmap->pm_ctx || pv->pv_pmap == pmap_kernel()) {
 				tsb_invalidate(pv->pv_pmap->pm_ctx,
 				    (pv->pv_va & PV_VAMASK));
@@ -2515,25 +2527,38 @@ pmap_page_protect(struct vm_page *pg, vm
 			atomic_dec_long(&pv->pv_pmap->pm_stats.resident_count);
 
 			/* free the pv */
-			firstpv->pv_next = pv->pv_next;
-			mtx_leave(&pg->mdpage.pvmtx);
 			pool_put(&pv_pool, pv);
 			mtx_enter(&pg->mdpage.pvmtx);
 		}
 
-		pv = firstpv;
+		/*
+		 * Make a local copy of the primary pv data to guard
+		 * against concurrent calls.
+		 */
+		pv = NULL;
+		if (firstpv->pv_pmap != NULL) {
+			pve = *firstpv;
+			pv = &pve;
+			firstpv->pv_pmap = NULL;
+		}
+		mtx_leave(&pg->mdpage.pvmtx);
+
+		if (pv != NULL) {
+			mtx_enter(&pv->pv_pmap->pm_mtx);
 
-		/* Then remove the primary pv */
-		if (pv->pv_pmap != NULL) {
+			/* Save REF/MOD info */
 			data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
+			mtx_enter(&pg->mdpage.pvmtx);
+			firstpv->pv_va |= pmap_tte2flags(data);
+			mtx_leave(&pg->mdpage.pvmtx);
 
-			/* Save REF/MOD info */
-			pv->pv_va |= pmap_tte2flags(data);
+			/* Clear mapping */
 			if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) {
 				printf("pmap_page_protect: gotten pseg empty!\n");
 				db_enter();
 				/* panic? */
 			}
+			mtx_leave(&pv->pv_pmap->pm_mtx);
 			if (pv->pv_pmap->pm_ctx || pv->pv_pmap == pmap_kernel()) {
 				tsb_invalidate(pv->pv_pmap->pm_ctx,
 				    (pv->pv_va & PV_VAMASK));
@@ -2541,13 +2566,9 @@ pmap_page_protect(struct vm_page *pg, vm
 				    pv->pv_pmap->pm_ctx);
 			}
 			atomic_dec_long(&pv->pv_pmap->pm_stats.resident_count);
-
-			KASSERT(pv->pv_next == NULL);
-			/* dump the first pv */
-			pv->pv_pmap = NULL;
 		}
+
 		dcache_flush_page(pa);
-		mtx_leave(&pg->mdpage.pvmtx);
 	}
 	/* We should really only flush the pages we demapped. */
 }

-- 
jca