From: Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject: Re: Please test: parallel fault handling
To: Mark Kettenis <mark.kettenis@xs4all.nl>, tech@openbsd.org
Date: Mon, 18 Aug 2025 14:02:32 +0200

On Thu, Aug 14, 2025 at 01:03:27PM +0200, Martin Pieuchot wrote:
> On 11/06/25(Wed) 13:14, Claudio Jeker wrote:
> > On Wed, Jun 11, 2025 at 12:12:58PM +0200, Mark Kettenis wrote:
> > > > Date: Mon, 9 Jun 2025 14:07:47 +0200
> > > > From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> > > > 
> > > > On Mon, Jun 09, 2025 at 01:46:31PM +0200, Jeremie Courreges-Anglas wrote:
> > > > > On Tue, Jun 03, 2025 at 06:21:17PM +0200, Jeremie Courreges-Anglas wrote:
> > > > > > On Sun, May 25, 2025 at 11:20:46PM +0200, Jeremie Courreges-Anglas wrote:
> > > > > > > On Thu, May 22, 2025 at 08:19:38PM +0200, Mark Kettenis wrote:
> > > > > > > > > Date: Thu, 22 May 2025 18:54:08 +0200
> > > > > > > > > From: Jeremie Courreges-Anglas <jca@wxcvbn.org>
> > > > > > > [...]
> > > > > > > > > *Bzzzt*
> > > > > > > > > 
> > > > > > > > > The same LDOM was busy compiling two devel/llvm copies under dpb(1).
> > > > > > > > > Input welcome, I'm not sure yet what other ddb commands could help.
> > > > > > > > > 
> > > > > > > > > login: panic: trap type 0x34 (mem address not aligned): pc=1012f68 npc=1012f6c pstate=820006<PRIV,IE>
> > > > > > > > > Stopped at      db_enter+0x8:   nop
> > > > > > > > >     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
> > > > > > > > >   57488   1522      0        0x11          0    1  perl
> > > > > > > > >  435923   9891     55   0x1000002          0    4  cc1plus
> > > > > > > > >  135860  36368     55   0x1000002          0   13  cc1plus
> > > > > > > > >  333743  96489     55   0x1000002          0    0  cc1plus
> > > > > > > > >  433162  55422     55   0x1000002          0    9  cc1plus
> > > > > > > > >  171658  49723     55   0x1000002          0    5  cc1plus
> > > > > > > > >   47127  57536     55   0x1000002          0   10  cc1plus
> > > > > > > > >   56600   9350     55   0x1000002          0   14  cc1plus
> > > > > > > > >  159792  13842     55   0x1000002          0    6  cc1plus
> > > > > > > > >  510019  10312     55   0x1000002          0    8  cc1plus
> > > > > > > > >   20489  65709     55   0x1000002          0   15  cc1plus
> > > > > > > > >  337455  42430     55   0x1000002          0   12  cc1plus
> > > > > > > > >  401407  80906     55   0x1000002          0   11  cc1plus
> > > > > > > > >   22993  62317     55   0x1000002          0    2  cc1plus
> > > > > > > > >  114916  17058     55   0x1000002          0    7  cc1plus
> > > > > > > > > *435412  33034      0     0x14000      0x200    3K pagedaemon
> > > > > > > > > trap(400fe6b19b0, 34, 1012f68, 820006, 3, 42) at trap+0x334
> > > > > > > > > Lslowtrap_reenter(40015a58a00, 77b5db2000, deadbeefdeadc0c7, 1d8, 2df0fc468, 468) at Lslowtrap_reenter+0xf8
> > > > > > > > > pmap_page_protect(40010716ab8, c16, 1cc9860, 193dfa0, 1cc9000, 1cc9000) at pmap_page_protect+0x1fc
> > > > > > > > > uvm_pagedeactivate(40010716a50, 40015a50d24, 18667a0, 0, 0, 1c8dac0) at uvm_pagedeactivate+0x54
> > > > > > > > > uvmpd_scan_active(0, 0, 270f2, 18667a0, 0, ffffffffffffffff) at uvmpd_scan_active+0x150
> > > > > > > > > uvm_pageout(400fe6b1e08, 55555556, 18667a0, 1c83f08, 1c83000, 1c8dc18) at uvm_pageout+0x2dc
> > > > > > > > > proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10
> > > > > > > > > https://www.openbsd.org/ddb.html describes the minimum info required in bug
> > > > > > > > > reports.  Insufficient info makes it difficult to find and fix bugs.
> > > > > > > > 
> > > > > > > > If there are pmap issues, pmap_page_protect() is certainly the first
> > > > > > > > place I'd look.  I'll start looking, but don't expect to have much
> > > > > > > > time until after monday.
> > > > > > > 
> > > > > > > Indeed this crash lies in pmap_page_protect().  llvm-objdump -dlS says
> > > > > > > it's stopped at l.2499:
> > > > > > > 
> > > > > > >         } else {
> > > > > > >                 pv_entry_t firstpv;
> > > > > > >                 /* remove mappings */
> > > > > > > 
> > > > > > >                 firstpv = pa_to_pvh(pa);
> > > > > > >                 mtx_enter(&pg->mdpage.pvmtx);
> > > > > > > 
> > > > > > >                 /* First remove the entire list of continuation pv's*/
> > > > > > >                 while ((pv = firstpv->pv_next) != NULL) {
> > > > > > > -->                     data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> > > > > > > 
> > > > > > >                         /* Save REF/MOD info */
> > > > > > >                         firstpv->pv_va |= pmap_tte2flags(data);
> > > > > > > 
> > > > > > > ; /sys/arch/sparc64/sparc64/pmap.c:2499
> > > > > > > ;                       data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK);
> > > > > > >     3c10: a7 29 30 0d   sllx %g4, 13, %l3
> > > > > > >     3c14: d2 5c 60 10   ldx [%l1+16], %o1
> > > > > > >     3c18: d0 5c 60 08   ldx [%l1+8], %o0
> > > > > > > --> 3c1c: 40 00 00 00   call 0
> > > > > > >     3c20: 92 0a 40 13   and %o1, %l3, %o1
> > > > > > > 
> > > > > > > As discussed with miod I suspect the crash actually lies inside
> > > > > > > pseg_get(), but I can't prove it.
> > > > > > 
> > > > > > Another similar crash, at the very same offset in pmap_page_protect,
> > > > > > with:
> > > > > > - pmap_collect() removed
> > > > > > - uvm_purge() applied
> > > > > > - uvm parallel fault applied
> > > > > 
> > > > > To try to reproduce this one, I went back to:
> > > > > - pmap_collect() applied
> > > > > - uvm_purge() backed out
> > > > > - uvm parallel fault applied
> > > > > - pmap_page_protect() simplification applied
> > > > > 
> > > > > In the parent mail in this thread I only dumped the first pv entry of
> > > > > the page.  Here we can see that the pmap of the second entry in the pv
> > > > > list appears corrupted.
> > > > > 
> > > > > This is relatively easy to reproduce for me, I just need to build rust
> > > > > and another big port in parallel to reproduce.  rust is a big user of
> > > > > threads.
> > > >  
> > > > I thought we already concluded that pmap_page_protect() is overly
> > > > optimistic and you had a diff to add extra locking to it.
> > > 
> > > While I have some doubts whether the atomic manipulation of the page
> > > tables correctly handles the tracking of the reference and
> > > modification bits, I do believe the locking (using the per-page mutex)
> > > is sufficient to prevent stale pmap references in the pv entries.  And
> > > I would really like to prevent the stupid lock dance that we do on
> > > other architectures.  But I must be missing something.
> > > 
> > > > I think the moment we do parallel uvm faults we run pmap_page_protect()
> > > > concurrent with some other pmap functions and get fireworks.
> > > 
> > > That would most likely be pmap_enter(). 
> >  
> > I will run with uvm parallel fault handling on my test sparc64 and see if
> > I can also hit the errors jca hit.
> 
> Could I move forward by #ifndef'ing sparc64?  I'd appreciate if somebody
> could debug the pmap issue.  In the meantime I believe we should enable
> this on the other architectures.
> 
> ok?

I'm very much unsure about this. At least from my understanding enabling
this causes Mac M1 and M2 machines to lock up and require dlg's parking
mutex diff to work. Also there is at the bug report on bugs@ about a dual
socket amd64 system wich hangs using nfdump which very much points again
at some pmap / ipi issue.
 
Yes, we can run a while with the enabled to get more feedback but I'm not
sure we have the time to find all the issues before release. Especially on
any arch apart from amd64 and arm64.

> Index: uvm/uvm_fault.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
> diff -u -p -r1.170 uvm_fault.c
> --- uvm/uvm_fault.c	14 Jul 2025 08:45:16 -0000	1.170
> +++ uvm/uvm_fault.c	14 Aug 2025 10:57:15 -0000
> @@ -662,7 +662,7 @@ uvm_fault(vm_map_t orig_map, vaddr_t vad
>  	flt.access_type = access_type;
>  	flt.narrow = FALSE;		/* assume normal fault for now */
>  	flt.wired = FALSE;		/* assume non-wired fault for now */
> -#if notyet
> +#ifndef __sparc64__
>  	flt.upper_lock_type = RW_READ;
>  	flt.lower_lock_type = RW_READ;	/* shared lock for now */
>  #else
> 
> 

-- 
:wq Claudio