Index | Thread | Search

From:
Mark Kettenis <mark.kettenis@xs4all.nl>
Subject:
Re: amd64: prefer enhanced REP MOVSB/STOSB feature if available
To:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Cc:
tech@openbsd.org
Date:
Mon, 22 Dec 2025 16:04:32 +0100

Download raw body.

Thread
> Date: Mon, 22 Dec 2025 13:31:49 +0100
> From: Claudio Jeker <cjeker@diehard.n-r-g.com>
> 
> On Mon, Dec 22, 2025 at 01:23:18PM +0100, Martin Pieuchot wrote:
> > As Mateusz Guzik pointed out recently [0] we can greatly reduce the
> > amount of CPU cycles spent zeroing pages by using 'rep stosb'.
> > 
> > Diff below does that, ok?
> > 
> > [0] https://marc.info/?l=openbsd-tech&m=176631121132731&w=2
> 
> Not my area but I think one issue we have since the introduction of
> __HAVE_UVM_PERCPU and struct uvm_pmr_cache is that the the system
> no longer uses the pre-zeroed pages provided by the zerothread.
> 
> I think fixing that and giving the system a way to have a per-cpu
> magazin of zeroed pages results in far less cycles wasted holding
> critical locks.

Well.  As Mateusz points out, pre-zeroing pages is probably a waste of
time.  At least on modern CPUs.  At least when you're going to use the
memory immediately after you allocate the pages.

The reasoning is simple.  If you allocate a thread that has been
pre-zeroed, the data in that page has to come back from main memory
(either because we use non-temporal stores to zero the page or because
it has been zeroed some time ago and the contents of the page are no
longer in the cache).  But if we zero the page when we allocate it,
the data is very likely still in the cache when we touch it.

Maybe the zerothread is still useful on older architectures with
smaller caches and slower CPUs.  But I don't think doubling down and
adding a per-cpu magazin of zeroed pages is going to help us.

> > Index: arch/amd64/amd64/locore.S
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/amd64/amd64/locore.S,v
> > diff -u -p -r1.151 locore.S
> > --- arch/amd64/amd64/locore.S	2 Aug 2025 07:33:28 -0000	1.151
> > +++ arch/amd64/amd64/locore.S	22 Dec 2025 11:54:32 -0000
> > @@ -1172,6 +1172,16 @@ ENTRY(pagezero)
> >  	lfence
> >  END(pagezero)
> >  
> > +ENTRY(pagezero_erms)
> > +	RETGUARD_SETUP(pagezero_erms, r11)
> > +	movq    $PAGE_SIZE,%rcx
> > +	xorq    %rax,%rax
> > +	rep stosb
> > +	RETGUARD_CHECK(pagezero_erms, r11)
> > +	ret
> > +	lfence
> > +END(pagezero_erms)
> > +
> >  /* void pku_xonly(void) */
> >  ENTRY(pku_xonly)
> >  	movq	pg_xo,%rax	/* have PKU support? */
> > Index: arch/amd64/amd64/pmap.c
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v
> > diff -u -p -r1.182 pmap.c
> > --- arch/amd64/amd64/pmap.c	15 Aug 2025 13:40:43 -0000	1.182
> > +++ arch/amd64/amd64/pmap.c	22 Dec 2025 11:55:07 -0000
> > @@ -1594,11 +1594,14 @@ pmap_extract(struct pmap *pmap, vaddr_t 
> >  /*
> >   * pmap_zero_page: zero a page
> >   */
> > -
> >  void
> >  pmap_zero_page(struct vm_page *pg)
> >  {
> > -	pagezero(pmap_map_direct(pg));
> > +	/* Prefer enhanced REP MOVSB/STOSB feature if available. */
> > +	if (ISSET(curcpu()->ci_feature_sefflags_ebx, SEFF0EBX_ERMS))
> > +		pagezero_erms(pmap_map_direct(pg));
> > +	else
> > +		pagezero(pmap_map_direct(pg));
> >  }
> >  
> >  /*
> > Index: arch/amd64/include/pmap.h
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/amd64/include/pmap.h,v
> > diff -u -p -r1.94 pmap.h
> > --- arch/amd64/include/pmap.h	7 Jul 2025 00:55:15 -0000	1.94
> > +++ arch/amd64/include/pmap.h	22 Dec 2025 11:46:09 -0000
> > @@ -403,6 +403,7 @@ void		pmap_write_protect(struct pmap *, 
> >  paddr_t	pmap_prealloc_lowmem_ptps(paddr_t);
> >  
> >  void	pagezero(vaddr_t);
> > +void	pagezero_erms(vaddr_t);
> >  
> >  void	pmap_convert(struct pmap *, int);
> >  void	pmap_enter_special(vaddr_t, paddr_t, vm_prot_t);
> > 
> > 
> 
> -- 
> :wq Claudio
> 
>