From: Mark Kettenis Subject: Re: Please test: parallel fault handling To: Jeremie Courreges-Anglas Cc: tech@openbsd.org Date: Thu, 22 May 2025 20:19:38 +0200 > Date: Thu, 22 May 2025 18:54:08 +0200 > From: Jeremie Courreges-Anglas > > On Mon, May 19, 2025 at 09:23:27PM +0200, Jeremie Courreges-Anglas wrote: > > On Tue, May 13, 2025 at 02:28:08PM +0200, Martin Pieuchot wrote: > > > On 13/05/25(Tue) 13:57, Jeremie Courreges-Anglas wrote: > > [...] > > > > The sparc64 LDOM went into two panics - I somehow managed to break it > > > > out of ddb after the first panic due to bogus conserver. The data > > > > below is minimal, sorry, I didn't recover the full trace from dmesg > > > > after the 1st panic, and 'mach ddbcpu 0' locked up during the 2nd > > > > panic. > > > > > > I wonder if sparc64's pmap is in need for some love... > > > > > > > > > > > I had not run a sparc64 bulk build on this machine in the recent > > > > months, and I don't know whether both panics are related to your diff, > > > > but I'll do my best to try and reproduce them. Currently I've > > > > restarted the bulk build without the uvm change. Maybe someone with > > > > more sparc64 knowledge will bring some clue here. > > > > > > Please let me know. > > > > I can already tell that this LDOM ran the rest of the bulk with the > > parallel uvm fault diff reverted... Now restarting another build from > > scratch, with your diff on top of -current. > > > > > You can also start by "ps /o" or "show proc". > > > > Hopefully that will be enough! mach ddbcpu x didn't seem very usable. > > *Bzzzt* > > The same LDOM was busy compiling two devel/llvm copies under dpb(1). > Input welcome, I'm not sure yet what other ddb commands could help. > > login: panic: trap type 0x34 (mem address not aligned): pc=1012f68 npc=1012f6c pstate=820006 > Stopped at db_enter+0x8: nop > TID PID UID PRFLAGS PFLAGS CPU COMMAND > 57488 1522 0 0x11 0 1 perl > 435923 9891 55 0x1000002 0 4 cc1plus > 135860 36368 55 0x1000002 0 13 cc1plus > 333743 96489 55 0x1000002 0 0 cc1plus > 433162 55422 55 0x1000002 0 9 cc1plus > 171658 49723 55 0x1000002 0 5 cc1plus > 47127 57536 55 0x1000002 0 10 cc1plus > 56600 9350 55 0x1000002 0 14 cc1plus > 159792 13842 55 0x1000002 0 6 cc1plus > 510019 10312 55 0x1000002 0 8 cc1plus > 20489 65709 55 0x1000002 0 15 cc1plus > 337455 42430 55 0x1000002 0 12 cc1plus > 401407 80906 55 0x1000002 0 11 cc1plus > 22993 62317 55 0x1000002 0 2 cc1plus > 114916 17058 55 0x1000002 0 7 cc1plus > *435412 33034 0 0x14000 0x200 3K pagedaemon > trap(400fe6b19b0, 34, 1012f68, 820006, 3, 42) at trap+0x334 > Lslowtrap_reenter(40015a58a00, 77b5db2000, deadbeefdeadc0c7, 1d8, 2df0fc468, 468) at Lslowtrap_reenter+0xf8 > pmap_page_protect(40010716ab8, c16, 1cc9860, 193dfa0, 1cc9000, 1cc9000) at pmap_page_protect+0x1fc > uvm_pagedeactivate(40010716a50, 40015a50d24, 18667a0, 0, 0, 1c8dac0) at uvm_pagedeactivate+0x54 > uvmpd_scan_active(0, 0, 270f2, 18667a0, 0, ffffffffffffffff) at uvmpd_scan_active+0x150 > uvm_pageout(400fe6b1e08, 55555556, 18667a0, 1c83f08, 1c83000, 1c8dc18) at uvm_pageout+0x2dc > proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10 > https://www.openbsd.org/ddb.html describes the minimum info required in bug > reports. Insufficient info makes it difficult to find and fix bugs. If there are pmap issues, pmap_page_protect() is certainly the first place I'd look. I'll start looking, but don't expect to have much time until after monday.