Download raw body.
amd64: prefer enhanced REP MOVSB/STOSB feature if available
On Mon, Dec 22, 2025 at 12:31:49PM +0000, Claudio Jeker wrote: > On Mon, Dec 22, 2025 at 01:23:18PM +0100, Martin Pieuchot wrote: > > As Mateusz Guzik pointed out recently [0] we can greatly reduce the > > amount of CPU cycles spent zeroing pages by using 'rep stosb'. > > > > Diff below does that, ok? > > > > [0] https://marc.info/?l=openbsd-tech&m=176631121132731&w=2 > > Not my area but I think one issue we have since the introduction of > __HAVE_UVM_PERCPU and struct uvm_pmr_cache is that the the system > no longer uses the pre-zeroed pages provided by the zerothread. > > I think fixing that and giving the system a way to have a per-cpu > magazin of zeroed pages results in far less cycles wasted holding > critical locks. On other systems pagezero or the equivalent is one of more expensive routines seen on the profile, so trying to speed it up or avoid in the first place makes perfect sense. One issue with background zeroing is that in practice it is actively detrimental in a multicore setting. Even ignoring that, if you have a real workload using the pages, whatever zeroed list you might have, it will get shredded immediately. Page allocation/freeing has very high traffic when building stuff, so from lock contention standpoint you want to stick to per-CPU caches as much as possible. Suppose your local cache is empty and you get to fill it all up with pre-zeroed pages. Suppose you allocate one page, then you managed to avoid zeroing. Afterwards you free one page and soon after get another request to allocate. Do you re-use the most recently freed page (hoping it hangs around in L3) or do you return one of the already zeroed-out pages? If the former, you very quickly obsolete the background zeroing thread as you end up zeroing everything yourself, and if the latter you are posed with another dilema: if your per-CPU cache has pages, but none of them are pre-zeroed and there are pre-zeroed pages in the global queue, what do you do? Presumably you don't want to take the global lock here so you stick to the cached stuff and zero on domend, but that again diminishes the background zeroing machinery. iow the better per-CPU caching works, the less use there is for background zeroing. But let's say you instead allow the background zeroing thread to handle pages stored in per-CPU caches. That would require real locking to alloc/free pages, which slows things down. So let's say the idle loop or similar is allowed to zero pages for the local CPU. Even then there are at leat 2 problems: 1. do you use temporal or non-temporal stores? You keep risking evicting lines from L3. 2. even ignoring the above, now that CPUs can actually idle, you are spending cycles instead of being idle. that's perhaps not that great? All in all, that's a lot of tradeoffs to deal with where it is not clear if there is any upside. Possibly something to evaluate after zeroing without it gets optimized. Currently the issues that I see are as follows: - per-CPU caches are way too small - fast path allocation and free avoidably serializes on stat counters - page flags get updated several times intead of once, all of this is atomic ops (i.e., SLOW) - slowpath again serializes due to false-sharing After all this is sorted out one idea which should help significantly is to fill in at least 2 pages (if possible) during a page fault. As in suppose you have an anonymous area of 2 pages or more and you get a fault on the first one. In practice the other one will very likely also get a fault sooner than later, so you can save on some work by handling both ini one go. Of course this in practice will increase RSS for *some* programs, but should not be much. Definitely something to be evaluated. fwiw I experimented with faulting 16KB instead of 4Kb on amd64 $elsewhere and got a real-world speed up from it thanks to cutting down on total page faults.
amd64: prefer enhanced REP MOVSB/STOSB feature if available