Index | Thread | Search

From:
Mateusz Guzik <mjguzik@gmail.com>
Subject:
Re: amd64: prefer enhanced REP MOVSB/STOSB feature if available
To:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Cc:
tech@openbsd.org
Date:
Tue, 23 Dec 2025 14:53:28 +0100

Download raw body.

Thread
On Mon, Dec 22, 2025 at 12:31:49PM +0000, Claudio Jeker wrote:
> On Mon, Dec 22, 2025 at 01:23:18PM +0100, Martin Pieuchot wrote:
> > As Mateusz Guzik pointed out recently [0] we can greatly reduce the
> > amount of CPU cycles spent zeroing pages by using 'rep stosb'.
> > 
> > Diff below does that, ok?
> > 
> > [0] https://marc.info/?l=openbsd-tech&m=176631121132731&w=2
> 
> Not my area but I think one issue we have since the introduction of
> __HAVE_UVM_PERCPU and struct uvm_pmr_cache is that the the system
> no longer uses the pre-zeroed pages provided by the zerothread.
> 
> I think fixing that and giving the system a way to have a per-cpu
> magazin of zeroed pages results in far less cycles wasted holding
> critical locks.

On other systems pagezero or the equivalent is one of more expensive
routines seen on the profile, so trying to speed it up or avoid in the
first place makes perfect sense.

One issue with background zeroing is that in practice it is actively
detrimental in a multicore setting. Even ignoring that, if you have a
real workload using the pages, whatever zeroed list you might have, it
will get shredded immediately.

Page allocation/freeing has very high traffic when building stuff, so
from lock contention standpoint you want to stick to per-CPU caches as
much as possible.

Suppose your local cache is empty and you get to fill it all up with
pre-zeroed pages.

Suppose you allocate one page, then you managed to avoid zeroing.
Afterwards you free one page and soon after get another request to
allocate.

Do you re-use the most recently freed page (hoping it hangs around in
L3) or do you return one of the already zeroed-out pages? If the former,
you very quickly obsolete the background zeroing thread as you end up
zeroing everything yourself, and if the latter you are posed with
another dilema: if your per-CPU cache has pages, but none of them are
pre-zeroed and there are pre-zeroed pages in the global queue, what do
you do? Presumably you don't want to take the global lock here so you
stick to the cached stuff and zero on domend, but that again diminishes
the background zeroing machinery.

iow the better per-CPU caching works, the less use there is for
background zeroing.

But let's say you instead allow the background zeroing thread to handle
pages stored in per-CPU caches. That would require real locking to
alloc/free pages, which slows things down.

So let's say the idle loop or similar is allowed to zero pages for the
local CPU. Even then there are at leat 2 problems:
1. do you use temporal or non-temporal stores? You keep risking evicting
lines from L3.
2. even ignoring the above, now that CPUs can actually idle, you are
spending cycles instead of being idle. that's perhaps not that great?

All in all, that's a lot of tradeoffs to deal with where it is not clear
if there is any upside.

Possibly something to evaluate after zeroing without it gets optimized.

Currently the issues that I see are as follows:
- per-CPU caches are way too small
- fast path allocation and free avoidably serializes on stat counters
- page flags get updated several times intead of once, all of this is
  atomic ops (i.e., SLOW)
- slowpath again serializes due to false-sharing

After all this is sorted out one idea which should help significantly is
to fill in at least 2 pages (if possible) during a page fault.

As in suppose you have an anonymous area of 2 pages or more and you get
a fault on the first one. In practice the other one will very likely
also get a fault sooner than later, so you can save on some work by
handling both ini one go.

Of course this in practice will increase RSS for *some* programs, but
should not be much. Definitely something to be evaluated.

fwiw I experimented with faulting 16KB instead of 4Kb on amd64
$elsewhere and got a real-world speed up from it thanks to cutting down
on total page faults.