From: Jeremie Courreges-Anglas Subject: Re: Please test: parallel fault handling To: Mark Kettenis Cc: tech@openbsd.org Date: Mon, 26 May 2025 09:23:02 +0200 On Mon, May 26, 2025 at 12:04:59AM +0200, Mark Kettenis wrote: > > Date: Sun, 25 May 2025 23:20:46 +0200 > > From: Jeremie Courreges-Anglas > > > > On Thu, May 22, 2025 at 08:19:38PM +0200, Mark Kettenis wrote: > > > > Date: Thu, 22 May 2025 18:54:08 +0200 > > > > From: Jeremie Courreges-Anglas > > [...] > > > > *Bzzzt* > > > > > > > > The same LDOM was busy compiling two devel/llvm copies under dpb(1). > > > > Input welcome, I'm not sure yet what other ddb commands could help. > > > > > > > > login: panic: trap type 0x34 (mem address not aligned): pc=1012f68 npc=1012f6c pstate=820006 > > > > Stopped at db_enter+0x8: nop > > > > TID PID UID PRFLAGS PFLAGS CPU COMMAND > > > > 57488 1522 0 0x11 0 1 perl > > > > 435923 9891 55 0x1000002 0 4 cc1plus > > > > 135860 36368 55 0x1000002 0 13 cc1plus > > > > 333743 96489 55 0x1000002 0 0 cc1plus > > > > 433162 55422 55 0x1000002 0 9 cc1plus > > > > 171658 49723 55 0x1000002 0 5 cc1plus > > > > 47127 57536 55 0x1000002 0 10 cc1plus > > > > 56600 9350 55 0x1000002 0 14 cc1plus > > > > 159792 13842 55 0x1000002 0 6 cc1plus > > > > 510019 10312 55 0x1000002 0 8 cc1plus > > > > 20489 65709 55 0x1000002 0 15 cc1plus > > > > 337455 42430 55 0x1000002 0 12 cc1plus > > > > 401407 80906 55 0x1000002 0 11 cc1plus > > > > 22993 62317 55 0x1000002 0 2 cc1plus > > > > 114916 17058 55 0x1000002 0 7 cc1plus > > > > *435412 33034 0 0x14000 0x200 3K pagedaemon > > > > trap(400fe6b19b0, 34, 1012f68, 820006, 3, 42) at trap+0x334 > > > > Lslowtrap_reenter(40015a58a00, 77b5db2000, deadbeefdeadc0c7, 1d8, 2df0fc468, 468) at Lslowtrap_reenter+0xf8 > > > > pmap_page_protect(40010716ab8, c16, 1cc9860, 193dfa0, 1cc9000, 1cc9000) at pmap_page_protect+0x1fc > > > > uvm_pagedeactivate(40010716a50, 40015a50d24, 18667a0, 0, 0, 1c8dac0) at uvm_pagedeactivate+0x54 > > > > uvmpd_scan_active(0, 0, 270f2, 18667a0, 0, ffffffffffffffff) at uvmpd_scan_active+0x150 > > > > uvm_pageout(400fe6b1e08, 55555556, 18667a0, 1c83f08, 1c83000, 1c8dc18) at uvm_pageout+0x2dc > > > > proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10 > > > > https://www.openbsd.org/ddb.html describes the minimum info required in bug > > > > reports. Insufficient info makes it difficult to find and fix bugs. > > > > > > If there are pmap issues, pmap_page_protect() is certainly the first > > > place I'd look. I'll start looking, but don't expect to have much > > > time until after monday. > > > > Indeed this crash lies in pmap_page_protect(). llvm-objdump -dlS says > > it's stopped at l.2499: > > > > } else { > > pv_entry_t firstpv; > > /* remove mappings */ > > > > firstpv = pa_to_pvh(pa); > > mtx_enter(&pg->mdpage.pvmtx); > > > > /* First remove the entire list of continuation pv's*/ > > while ((pv = firstpv->pv_next) != NULL) { > > --> data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK); > > > > /* Save REF/MOD info */ > > firstpv->pv_va |= pmap_tte2flags(data); > > > > ; /sys/arch/sparc64/sparc64/pmap.c:2499 > > ; data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK); > > 3c10: a7 29 30 0d sllx %g4, 13, %l3 > > 3c14: d2 5c 60 10 ldx [%l1+16], %o1 > > 3c18: d0 5c 60 08 ldx [%l1+8], %o0 > > --> 3c1c: 40 00 00 00 call 0 > > 3c20: 92 0a 40 13 and %o1, %l3, %o1 > > > > As discussed with miod I suspect the crash actually lies inside > > pseg_get(), but I can't prove it. > > Can you try the diff below. I've nuked pmap_collect() on other > architectures since it is hard to make "mpsafe". The current sparc64 > certainly isn't. > > If this fixes the crashes, I can see if it is possible to make it > "mpsafe" on sparc64. I see. Indeed this function stood out when I looked at the file. Can I really just empty it out like this, or should I also drop __HAVE_PMAP_COLLECT? Yesterday the machine crashed again in a nearby place in pmap_page_protect(). login: pmap_page_protect: gotten pseg empty! Stopped at pmap_page_protect+0x620: nop ddb{4}> tr uvm_pagedeactivate(4000d1510a0, 19d2ce0, 193edd0, 40015a507e4, 1, 1) at uvm_pagedeactivate+0x54 uvn_flush(401053a3310, 0, 11c8000, 14, 1, 0) at uvn_flush+0x448 uvn_detach(401053a3310, 40101633630, 1, 0, 0, 1000000) at uvn_detach+0x158 uvm_unmap_detach(400fe6b5c68, 0, 9a7, 40015a507e4, 40015a507e4, 18f7828) at uvm_unmap_detach+0x68 uvm_map_teardown(400130912e0, 4000, 1939240, 4010599c000, 54d, 17b2ff8) at uvm_map_teardown+0x184 uvmspace_free(1cd7000, 400130912e0, 4, 17b2ff8, 0, 6) at uvmspace_free+0x64 reaper(40015a507e0, 40015a507e0, db, 100, 1c04038, 4000) at reaper+0x100 proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10 ddb{4}> This time it's pmap.c l.2534: /* Then remove the primary pv */ if (pv->pv_pmap != NULL) { data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK); /* Save REF/MOD info */ pv->pv_va |= pmap_tte2flags(data); if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) { printf("pmap_page_protect: gotten pseg empty!\n"); --> db_enter(); /* panic? */ } It's not the first crash with uvmspace_free() going through uvm_pagedeactivate(), so I agree with you that a fix is probably needed for mpi's uvm_purge() diff (which I haven't tested yet on that box). -- jca