Download raw body.
amd64: prefer enhanced REP MOVSB/STOSB feature if available
On Mon, Dec 22, 2025 at 12:23:18PM +0000, Martin Pieuchot wrote:
> As Mateusz Guzik pointed out recently [0] we can greatly reduce the
> amount of CPU cycles spent zeroing pages by using 'rep stosb'.
>
I said rep stosq, but it definitely holds for stosb as well. :)
> Index: arch/amd64/amd64/locore.S
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/amd64/locore.S,v
> diff -u -p -r1.151 locore.S
> --- arch/amd64/amd64/locore.S 2 Aug 2025 07:33:28 -0000 1.151
> +++ arch/amd64/amd64/locore.S 22 Dec 2025 11:54:32 -0000
> @@ -1172,6 +1172,16 @@ ENTRY(pagezero)
> lfence
> END(pagezero)
>
> +ENTRY(pagezero_erms)
> + RETGUARD_SETUP(pagezero_erms, r11)
> + movq $PAGE_SIZE,%rcx
> + xorq %rax,%rax
this can be (same thing but shorter):
xorl %eax, %eax
> + rep stosb
For reasons unknown to me BSD codebases keep the rep prefix on a
separate line, so:
rep
stosb
> + RETGUARD_CHECK(pagezero_erms, r11)
> + ret
> + lfence
> +END(pagezero_erms)
> +
> /* void pku_xonly(void) */
> ENTRY(pku_xonly)
> movq pg_xo,%rax /* have PKU support? */
> Index: arch/amd64/amd64/pmap.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v
> diff -u -p -r1.182 pmap.c
> --- arch/amd64/amd64/pmap.c 15 Aug 2025 13:40:43 -0000 1.182
> +++ arch/amd64/amd64/pmap.c 22 Dec 2025 11:55:07 -0000
> @@ -1594,11 +1594,14 @@ pmap_extract(struct pmap *pmap, vaddr_t
> /*
> * pmap_zero_page: zero a page
> */
> -
> void
> pmap_zero_page(struct vm_page *pg)
> {
> - pagezero(pmap_map_direct(pg));
> + /* Prefer enhanced REP MOVSB/STOSB feature if available. */
> + if (ISSET(curcpu()->ci_feature_sefflags_ebx, SEFF0EBX_ERMS))
> + pagezero_erms(pmap_map_direct(pg));
> + else
> + pagezero(pmap_map_direct(pg));
> }
The general point about not using non-temporal stores very much holds
for all CPUs regardless of ERMS. As in the zeroing at hand can be done
by rep stosq. While stosb is recommended if ERMS is present, I think one
can suffer not using it for the time being.
The variant I tested with is at the end.
I did not submit the diff as is because of the current expectation that
pagezero uses non-temporal stores. I think the safest way forward for
the now is to add pmap_zero_page_cached or similar + matching
pagezero_cached and sprinkle the use as needed.
diff --git a/sys/arch/amd64/amd64/locore.S b/sys/arch/amd64/amd64/locore.S
index 2c19fbf0a30..c5d7c6f48d6 100644
--- a/sys/arch/amd64/amd64/locore.S
+++ b/sys/arch/amd64/amd64/locore.S
@@ -1156,17 +1156,10 @@ CODEPATCH_CODE_LEN(_jmpr13, jmp *%r13; int3)
ENTRY(pagezero)
RETGUARD_SETUP(pagezero, r11)
- movq $-PAGE_SIZE,%rdx
- subq %rdx,%rdi
- xorq %rax,%rax
-1:
- movnti %rax,(%rdi,%rdx)
- movnti %rax,8(%rdi,%rdx)
- movnti %rax,16(%rdi,%rdx)
- movnti %rax,24(%rdi,%rdx)
- addq $32,%rdx
- jne 1b
- sfence
+ movl $PAGE_SIZE/8,%ecx
+ xorl %eax,%eax
+ rep
+ stosq
RETGUARD_CHECK(pagezero, r11)
ret
lfence
amd64: prefer enhanced REP MOVSB/STOSB feature if available