From: Mark Kettenis Subject: Re: sparc64 pmap (was: Re: Please test: parallel fault handling) To: Jeremie Courreges-Anglas Cc: tech@openbsd.org, miod@online.fr Date: Tue, 27 May 2025 15:26:29 +0200 > Date: Mon, 26 May 2025 13:38:39 +0200 > From: Jeremie Courreges-Anglas Hi Jeremie, > On Mon, May 26, 2025 at 09:23:02AM +0200, Jeremie Courreges-Anglas wrote: > > On Mon, May 26, 2025 at 12:04:59AM +0200, Mark Kettenis wrote: > > > > > > Can you try the diff below. I've nuked pmap_collect() on other > > > architectures since it is hard to make "mpsafe". The current sparc64 > > > certainly isn't. > > > > > > If this fixes the crashes, I can see if it is possible to make it > > > "mpsafe" on sparc64. > > > > I see. Indeed this function stood out when I looked at the file. Can > > I really just empty it out like this, or should I also drop > > __HAVE_PMAP_COLLECT? > > > > Yesterday the machine crashed again in a nearby place in > > pmap_page_protect(). That is still without my pmap_collect() diff isn't it? > > login: pmap_page_protect: gotten pseg empty! > > Stopped at pmap_page_protect+0x620: nop > > ddb{4}> tr > > uvm_pagedeactivate(4000d1510a0, 19d2ce0, 193edd0, 40015a507e4, 1, 1) at uvm_pagedeactivate+0x54 > > uvn_flush(401053a3310, 0, 11c8000, 14, 1, 0) at uvn_flush+0x448 > > uvn_detach(401053a3310, 40101633630, 1, 0, 0, 1000000) at uvn_detach+0x158 > > uvm_unmap_detach(400fe6b5c68, 0, 9a7, 40015a507e4, 40015a507e4, 18f7828) at uvm_unmap_detach+0x68 > > uvm_map_teardown(400130912e0, 4000, 1939240, 4010599c000, 54d, 17b2ff8) at uvm_map_teardown+0x184 > > uvmspace_free(1cd7000, 400130912e0, 4, 17b2ff8, 0, 6) at uvmspace_free+0x64 > > reaper(40015a507e0, 40015a507e0, db, 100, 1c04038, 4000) at reaper+0x100 > > proc_trampoline(0, 0, 0, 0, 0, 0) at proc_trampoline+0x10 > > ddb{4}> > > > > This time it's pmap.c l.2534: > > > > /* Then remove the primary pv */ > > if (pv->pv_pmap != NULL) { > > data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK); > > > > /* Save REF/MOD info */ > > pv->pv_va |= pmap_tte2flags(data); > > if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) { > > printf("pmap_page_protect: gotten pseg empty!\n"); > > --> db_enter(); > > /* panic? */ > > } > > > > It's not the first crash with uvmspace_free() going through > > uvm_pagedeactivate(), so I agree with you that a fix is probably > > needed for mpi's uvm_purge() diff (which I haven't tested yet on that > > box). > > FWIW pmap_page_protect() itself looks like a potential problem, as you > suspected. The diff below locks the pmaps in pmap_page_protect(PROT_NONE) > which is used by uvm_pagedeactivate. I don't rule out issues in pmap_page_protect() yet. However, this could still be another manifestation of the issue with pmap_collect(). > Walking the pv list and locking the pmap introduces a lock ordering > constraint, something that the arm64 and riscv64 implems handle with a > slightly convoluted approach. Surely sparc64 should have its > pseg_get/set calls protected by the pmap lock too. Now, sparc64 is > special in several regards: each mapping must be checked for mod/ref > information by accessing psegs, pv list handling is inlined, and the > first element in the pv list is embedded in struct mdpage. > > The diff below drops the current pv entry from the list before acting > on it, thus avoiding concurrent access and the "start all over again" > idiom. For the "first" pv entry, we can make a copy of it so that we > can drop the pv lock. It could be argued that this static pv > header/entry is making things more complex than they need to be. > > The code keeps the mod/ref accounting. But does it make sense to keep > the mod/ref information at all, if we're flushing/destroying all the > mappings to a page? If it is indeed pointless, which I'm uncertain > about, we could drop the pseg_get/mtx_enter/etc chunks. The > performance of the diff below is similar to before applying it (with > the uvm parallel fault diff applied). > > In this very function there's the "lower protection" code path that > walks the pv list and access psegs. Not sure yet how to tackle it, > but maybe we don't need to handle it yet? > > Thanks to miod for shedding some light on this pmap implementation > over the week-end (he hasn't seen the diff yet). > > Input welcome. The pseg_get() and pseg_set() functions use atomic instructions to manipulate the page tables, so they shouldn't need the pmap lock. But we can't remove pages from the page tables (which is what pmap_collect() does), because then these function may read data from a recycled page (or even modify it). Note that this isn't just a problem for pseg_get() and pseg_set(), but also for the "inline pseg_get()" that is part of low-level fault handlers. But there is a potential race with the REF/MOD bits that may need some attenion. I'l have another look at this diff soon. > Index: pmap.c > =================================================================== > RCS file: /cvs/src/sys/arch/sparc64/sparc64/pmap.c,v > diff -u -p -r1.121 pmap.c > --- pmap.c 26 Jun 2024 01:40:49 -0000 1.121 > +++ pmap.c 26 May 2025 08:26:39 -0000 > @@ -2489,6 +2489,7 @@ pmap_page_protect(struct vm_page *pg, vm > mtx_leave(&pg->mdpage.pvmtx); > } else { > pv_entry_t firstpv; > + struct pv_entry pve; > /* remove mappings */ > > firstpv = pa_to_pvh(pa); > @@ -2496,10 +2497,19 @@ pmap_page_protect(struct vm_page *pg, vm > > /* First remove the entire list of continuation pv's*/ > while ((pv = firstpv->pv_next) != NULL) { > - data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK); > + /* Remove item from the list, avoids concurrent calls */ > + firstpv->pv_next = pv->pv_next; > + pv->pv_next = NULL; > + > + /* Unlock the pv lock to avoid ordering problems */ > + mtx_leave(&pg->mdpage.pvmtx); > + mtx_enter(&pv->pv_pmap->pm_mtx); > > /* Save REF/MOD info */ > + data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK); > + mtx_enter(&pg->mdpage.pvmtx); > firstpv->pv_va |= pmap_tte2flags(data); > + mtx_leave(&pg->mdpage.pvmtx); > > /* Clear mapping */ > if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) { > @@ -2507,6 +2517,8 @@ pmap_page_protect(struct vm_page *pg, vm > db_enter(); > /* panic? */ > } > + mtx_leave(&pv->pv_pmap->pm_mtx); > + > if (pv->pv_pmap->pm_ctx || pv->pv_pmap == pmap_kernel()) { > tsb_invalidate(pv->pv_pmap->pm_ctx, > (pv->pv_va & PV_VAMASK)); > @@ -2515,25 +2527,38 @@ pmap_page_protect(struct vm_page *pg, vm > atomic_dec_long(&pv->pv_pmap->pm_stats.resident_count); > > /* free the pv */ > - firstpv->pv_next = pv->pv_next; > - mtx_leave(&pg->mdpage.pvmtx); > pool_put(&pv_pool, pv); > mtx_enter(&pg->mdpage.pvmtx); > } > > - pv = firstpv; > + /* > + * Make a local copy of the primary pv data to guard > + * against concurrent calls. > + */ > + pv = NULL; > + if (firstpv->pv_pmap != NULL) { > + pve = *firstpv; > + pv = &pve; > + firstpv->pv_pmap = NULL; > + } > + mtx_leave(&pg->mdpage.pvmtx); > + > + if (pv != NULL) { > + mtx_enter(&pv->pv_pmap->pm_mtx); > > - /* Then remove the primary pv */ > - if (pv->pv_pmap != NULL) { > + /* Save REF/MOD info */ > data = pseg_get(pv->pv_pmap, pv->pv_va & PV_VAMASK); > + mtx_enter(&pg->mdpage.pvmtx); > + firstpv->pv_va |= pmap_tte2flags(data); > + mtx_leave(&pg->mdpage.pvmtx); > > - /* Save REF/MOD info */ > - pv->pv_va |= pmap_tte2flags(data); > + /* Clear mapping */ > if (pseg_set(pv->pv_pmap, pv->pv_va & PV_VAMASK, 0, 0)) { > printf("pmap_page_protect: gotten pseg empty!\n"); > db_enter(); > /* panic? */ > } > + mtx_leave(&pv->pv_pmap->pm_mtx); > if (pv->pv_pmap->pm_ctx || pv->pv_pmap == pmap_kernel()) { > tsb_invalidate(pv->pv_pmap->pm_ctx, > (pv->pv_va & PV_VAMASK)); > @@ -2541,13 +2566,9 @@ pmap_page_protect(struct vm_page *pg, vm > pv->pv_pmap->pm_ctx); > } > atomic_dec_long(&pv->pv_pmap->pm_stats.resident_count); > - > - KASSERT(pv->pv_next == NULL); > - /* dump the first pv */ > - pv->pv_pmap = NULL; > } > + > dcache_flush_page(pa); > - mtx_leave(&pg->mdpage.pvmtx); > } > /* We should really only flush the pages we demapped. */ > } > > -- > jca >