Index | Thread | Search

From:
Mark Kettenis <mark.kettenis@xs4all.nl>
Subject:
Re: Please test: parallel fault handling
To:
Jeremie Courreges-Anglas <jca@wxcvbn.org>
Cc:
tech@openbsd.org
Date:
Mon, 26 May 2025 00:04:59 +0200

Download raw body.

Thread
> Date: Sun, 25 May 2025 23:20:46 +0200
> From: Jeremie Courreges-Anglas <jca@wxcvbn.org>
> 
> On Thu, May 22, 2025 at 08:19:38PM +0200, Mark Kettenis wrote:
> > > Date: Thu, 22 May 2025 18:54:08 +0200
> > > From: Jeremie Courreges-Anglas <jca@wxcvbn.org>
> [...]
> > > *Bzzzt*
> > > 
> > > The same LDOM was busy compiling two devel/llvm copies under dpb(1).
> > > Input welcome, I'm not sure yet what other ddb commands could help.
> > > 
> > > login: panic: trap type 0x34 (mem address not aligned): pc=1012f68 npc=1012f6c pstate=820006<PRIV,IE>
> > > Stopped at      db_enter+0x8:   nop
> > >     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> > >   57488   1522      0        0x11          0    1  perl
> > >  435923   9891     55   0x1000002          0    4  cc1plus
> > >  135860  36368     55   0x1000002          0   13  cc1plus
> > >  333743  96489     55   0x1000002          0    0  cc1plus
> > >  433162  55422     55   0x1000002          0    9  cc1plus
> > >  171658  49723     55   0x1000002          0    5  cc1plus
> > >   47127  57536     55   0x1000002          0   10  cc1plus
> > >   56600   9350     55   0x1000002          0   14  cc1plus
> > >  159792  13842     55   0x1000002          0    6  cc1plus
> > >  510019  10312     55   0x1000002          0    8  cc1plus
> > >   20489  65709     55   0x1000002          0   15  cc1plus
> > >  337455  42430     55   0x1000002          0   12  cc1plus
> > >  401407  80906     55   0x1000002          0   11  cc1plus
> > >   22993  62317     55   0x1000002          0    2  cc1plus
> > >  114916  17058     55   0x1000002          0    7  cc1plus
> > > *435412  33034      0     0x14000      0x200    3K pagedaemon
> > > trap(400fe6b19b0, 34, 1012f68, 820006, 3, 42) at trap+0x334
> > > Lslowtrap_reenter(40015a58a00, 77b5db2000, deadbeefdeadc0c7, 1d8, 2df0fc468, 468) at Lslowtrap_reenter+0xf8
> > > pmap_page_protect(40010716ab8, c16, 1cc9860, 193dfa0, 1cc9000, 1cc9000) at pmap_page_protect+0x1fc
> > > uvm_pagedeactivate(40010716a50, 40015a50d24, 18667a0, 0, 0, 1c8dac0) at uvm_pagedeactivate+0x54
> > > uvmpd_scan_active(0, 0, 270f2, 18667a0, 0, ffffffffffffffff) at uvmpd_scan_active+0x150
> > > uvm_pageout(400fe6b1e08, 55555556, 18667a0, 1c83f08, 1c83000, 1c8dc18) at uvm_pageout+0x2dc
> > > proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10
> > > https://www.openbsd.org/ddb.html describes the minimum info required in bug
> > > reports.  Insufficient info makes it difficult to find and fix bugs.
> > 
> > If there are pmap issues, pmap_page_protect() is certainly the first
> > place I'd look.  I'll start looking, but don't expect to have much
> > time until after monday.
> 
> Indeed this crash lies in pmap_page_protect().  llvm-objdump -dlS says
> it's stopped at l.2499:
> 
>         } else {
>                 pv_entry_t firstpv;
>                 /* remove mappings */
> 
>                 firstpv = pa_to_pvh(pa);
>                 mtx_enter(&pg->mdpage.pvmtx);
> 
>                 /* First remove the entire list of continuation pv's*/
>                 while ((pv = firstpv->pv_next) != NULL) {
> -->                     data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> 
>                         /* Save REF/MOD info */
>                         firstpv->pv_va |= pmap_tte2flags(data);
> 
> ; /sys/arch/sparc64/sparc64/pmap.c:2499
> ;                       data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
>     3c10: a7 29 30 0d   sllx %g4, 13, %l3
>     3c14: d2 5c 60 10   ldx [%l1+16], %o1
>     3c18: d0 5c 60 08   ldx [%l1+8], %o0
> --> 3c1c: 40 00 00 00   call 0
>     3c20: 92 0a 40 13   and %o1, %l3, %o1
> 
> As discussed with miod I suspect the crash actually lies inside
> pseg_get(), but I can't prove it.

Can you try the diff below.  I've nuked pmap_collect() on other
architectures since it is hard to make "mpsafe".  The current sparc64
certainly isn't.

If this fixes the crashes, I can see if it is possible to make it
"mpsafe" on sparc64.


Index: arch/sparc64/sparc64/pmap.c
===================================================================
RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v
diff -u -p -r1.121 pmap.c
--- arch/sparc64/sparc64/pmap.c	26 Jun 2024 01:40:49 -0000	1.121
+++ arch/sparc64/sparc64/pmap.c	25 May 2025 22:02:13 -0000
@@ -1533,43 +1533,6 @@ pmap_release(struct pmap *pm)
 void
 pmap_collect(struct pmap *pm)
 {
-	int i, j, k, n, m, s;
-	paddr_t *pdir, *ptbl;
-	/* This is a good place to scan the pmaps for page tables with
-	 * no valid mappings in them and free them. */
-	
-	/* NEVER GARBAGE COLLECT THE KERNEL PMAP */
-	if (pm == pmap_kernel())
-		return;
-
-	s = splvm();
-	for (i=0; i<STSZ; i++) {
-		if ((pdir = (paddr_t *)(u_long)ldxa((vaddr_t)&pm->pm_segs[i], ASI_PHYS_CACHED))) {
-			m = 0;
-			for (k=0; k<PDSZ; k++) {
-				if ((ptbl = (paddr_t *)(u_long)ldxa((vaddr_t)&pdir[k], ASI_PHYS_CACHED))) {
-					m++;
-					n = 0;
-					for (j=0; j<PTSZ; j++) {
-						int64_t data = ldxa((vaddr_t)&ptbl[j], ASI_PHYS_CACHED);
-						if (data&TLB_V)
-							n++;
-					}
-					if (!n) {
-						/* Free the damn thing */
-						stxa((paddr_t)(u_long)&pdir[k], ASI_PHYS_CACHED, 0);
-						pmap_free_page((paddr_t)ptbl, pm);
-					}
-				}
-			}
-			if (!m) {
-				/* Free the damn thing */
-				stxa((paddr_t)(u_long)&pm->pm_segs[i], ASI_PHYS_CACHED, 0);
-				pmap_free_page((paddr_t)pdir, pm);
-			}
-		}
-	}
-	splx(s);
 }
 
 void