From: Mark Kettenis <mark.kettenis@xs4all.nl>
Subject: Re: Please test: parallel fault handling
To: Jeremie Courreges-Anglas <jca@wxcvbn.org>
Cc: tech@openbsd.org
Date: Thu, 22 May 2025 20:19:38 +0200

> Date: Thu, 22 May 2025 18:54:08 +0200
> From: Jeremie Courreges-Anglas <jca@wxcvbn.org>
> 
> On Mon, May 19, 2025 at 09:23:27PM +0200, Jeremie Courreges-Anglas wrote:
> > On Tue, May 13, 2025 at 02:28:08PM +0200, Martin Pieuchot wrote:
> > > On 13/05/25(Tue) 13:57, Jeremie Courreges-Anglas wrote:
> > [...]
> > > > The sparc64 LDOM went into two panics - I somehow managed to break it
> > > > out of ddb after the first panic due to bogus conserver.  The data
> > > > below is minimal, sorry, I didn't recover the full trace from dmesg
> > > > after the 1st panic, and 'mach ddbcpu 0' locked up during the 2nd
> > > > panic.
> > > 
> > > I wonder if sparc64's pmap is in need for some love...
> > > 
> > > > 
> > > > I had not run a sparc64 bulk build on this machine in the recent
> > > > months, and I don't know whether both panics are related to your diff,
> > > > but I'll do my best to try and reproduce them.  Currently I've
> > > > restarted the bulk build without the uvm change. Maybe someone with
> > > > more sparc64 knowledge will bring some clue here.
> > > 
> > > Please let me know.
> > 
> > I can already tell that this LDOM ran the rest of the bulk with the
> > parallel uvm fault diff reverted...  Now restarting another build from
> > scratch, with your diff on top of -current.
> > 
> > > You can also start by "ps /o" or "show proc".
> > 
> > Hopefully that will be enough!  mach ddbcpu x didn't seem very usable.
> 
> *Bzzzt*
> 
> The same LDOM was busy compiling two devel/llvm copies under dpb(1).
> Input welcome, I'm not sure yet what other ddb commands could help.
> 
> login: panic: trap type 0x34 (mem address not aligned): pc=1012f68 npc=1012f6c pstate=820006<PRIV,IE>
> Stopped at      db_enter+0x8:   nop
>     TID    PID    UID     PRFLAGS     PFLAGS  CPU  COMMAND
>   57488   1522      0        0x11          0    1  perl
>  435923   9891     55   0x1000002          0    4  cc1plus
>  135860  36368     55   0x1000002          0   13  cc1plus
>  333743  96489     55   0x1000002          0    0  cc1plus
>  433162  55422     55   0x1000002          0    9  cc1plus
>  171658  49723     55   0x1000002          0    5  cc1plus
>   47127  57536     55   0x1000002          0   10  cc1plus
>   56600   9350     55   0x1000002          0   14  cc1plus
>  159792  13842     55   0x1000002          0    6  cc1plus
>  510019  10312     55   0x1000002          0    8  cc1plus
>   20489  65709     55   0x1000002          0   15  cc1plus
>  337455  42430     55   0x1000002          0   12  cc1plus
>  401407  80906     55   0x1000002          0   11  cc1plus
>   22993  62317     55   0x1000002          0    2  cc1plus
>  114916  17058     55   0x1000002          0    7  cc1plus
> *435412  33034      0     0x14000      0x200    3K pagedaemon
> trap(400fe6b19b0, 34, 1012f68, 820006, 3, 42) at trap+0x334
> Lslowtrap_reenter(40015a58a00, 77b5db2000, deadbeefdeadc0c7, 1d8, 2df0fc468, 468) at Lslowtrap_reenter+0xf8
> pmap_page_protect(40010716ab8, c16, 1cc9860, 193dfa0, 1cc9000, 1cc9000) at pmap_page_protect+0x1fc
> uvm_pagedeactivate(40010716a50, 40015a50d24, 18667a0, 0, 0, 1c8dac0) at uvm_pagedeactivate+0x54
> uvmpd_scan_active(0, 0, 270f2, 18667a0, 0, ffffffffffffffff) at uvmpd_scan_active+0x150
> uvm_pageout(400fe6b1e08, 55555556, 18667a0, 1c83f08, 1c83000, 1c8dc18) at uvm_pageout+0x2dc
> proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10
> https://www.openbsd.org/ddb.html describes the minimum info required in bug
> reports.  Insufficient info makes it difficult to find and fix bugs.

If there are pmap issues, pmap_page_protect() is certainly the first
place I'd look.  I'll start looking, but don't expect to have much
time until after monday.