From: Mateusz Guzik <mjguzik@gmail.com>
Subject: Re: [PATCH] amd64: import optimized memcmp from FreeBSD
To: Christian Schulte <cs@schulte.it>
Cc: tech@openbsd.org, deraadt@openbsd.org
Date: Mon, 2 Dec 2024 08:03:30 +0100

On Mon, Dec 2, 2024 at 7:57 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Mon, Dec 02, 2024 at 02:21:09AM +0000, Christian Schulte wrote:
> > On 11/30/24 15:59, Christian Weisgerber wrote:
> > > Christian Schulte:
> > >
> > >> lately I thought about rewriting those functions to use simd, but never
> > >> looked further into doing it, because I do not know if simd
> > >> registers/instructions could be used there at all. Could they? Instead
> > >
> > > Presumably not in the kernel as we would have to save/restore
> > > FPU/SIMD registers around those calls.
> > >
> > >> of working on the same thing every now and then, I would be in favour of
> > >> providing routines using the most performant instructions the
> > >> architecture has to offer (simd), if possible, and be done with it.
> > >> Would not mind writing those, if they would be accepted.
> > >
> > > FreeBSD already did all that work recently.
> > > https://freebsdfoundation.org/blog/a-sneak-peek-simd-enhanced-string-functions-for-amd64/
> > >
> >
> > Do you know if they have published some test data somewhere? Something
> > which can be used to compare results. I got curious about it (again) and
> > instead of directly jumping onto the assembly train, I first wanted to
> > find out what the compiler will produce when writing a memcmp function
> > in C. Compared to the current memcmp in base a C memcmp compiled with
> > -O2 will make the compiler use SIMD instructions. I am only testing it
> > when equal so that the whole length has to be scanned.
> >
> > I tested various len values (0,1,2,3,4,5,6,7,8...). Either I am a total
> > idiot or the C version really outperforms the assembly version. That's
> > for a len == 517.
> >
> >   %   cumulative   self              self     total
> >  time   seconds   seconds    calls  ms/call  ms/call  name
> >  72.1     220.33   220.33 1000000000     0.00     0.00  _libc_memcmp [3]
> >  17.9     274.98    54.65 1000000000     0.00     0.00  c_memcmp [4]
> >
> > And that's for a len == 13.
> >
> >  23.6      20.67     9.04 1000000000     0.00     0.00  _libc_memcmp [4]
> >  20.9      28.67     8.00 1000000000     0.00     0.00  c_memcmp [5]
> >
> > This can't hardly be, can it?
> >
>
> As I already mentioned the current asm routines are of very poor quality
> and one would have to try to not do better.
>
> However, I'm deeply confused what exactly have you tested above.
>
> OpenBSD libc for the amd64 platform uses a C variant of the memcmp
> routine (as opposed to the kernel which has the problematic rep variant)
> The compiler did not emit SIMD for it in libc.
>
> Compiling the lib/libc/string/memcmp.c file by hand with cc -O2 -c gives
> the same assembly assembly as libc. I'm trying this out on OpenBSD 7.6
> with clang 16.0.6.
>
> Anyhow yes, the compiler *can* emit some simd here and it even got
> extensions to support overlapping stores.
>
> Personally I benchmarked with this:
> https://people.freebsd.org/~mjg/.junk/will-it-scale-memcmp.tgz
>
> $ sh compile-memcmp.sh
>
> Best used on FreeBSD or Linux due to CPU pinning. If you insist on
> OpenBSD you will need to replace main.c with:
> https://people.freebsd.org/~mjg/.junk/64/main.c
>
> and whack -lhwloc lines
>
> bench with random sizes up to 8 bytes, 1 mln samples:
> $ sh bench-memcmp.sh sizes_8_1000000
>
> The archive ships with other seeds. More can be generated with gen-sizes.c
>
> My work was concerning the kernel variant, where passed sizes very
> heavily skew to < 8 and I added an upper a limit which fits in 1 byte to
> reduce cache pressure while iterating the set. This probably can be
> bumped to short.
>
> Anyhow.
>
> FreeBSD libc ships with a SSE2 variant of memcmp (written by someone
> else). SSE2 being part of AMD64 ISA means it can be used as is, no
> ifunc.
>
> Shipped in this thread is a SIMD-less variant written by me specifically
> for kernel usage.
>
> What can be done for the kernel for nice speed ups is porting over the
> MEMMOVE macro and its associated uses. See:
> https://cgit.freebsd.org/src/tree/sys/amd64/amd64/support.S#n271
>
> This includes memcpy, memmove, copyin, copyout.
>
> Apart from that there is patched up memset and copystr/copyinstr.
>
> Important note for libc is that memcpy/memmove/bcopy and memset/bzero
> remain simdless and thus can be imported with 0 hassle, in particular
> they are safe to use from the loader itself.
>
> These are one of the most commonly used as well, so I would say
> importing right now is an easy call. Worst case someone can add
> something faster later.
>
> Routines which do use SIMD may or may not require stubs for the loader.

Of course forgot to add some results:

Trusty Haswell:
$ sh bench-memcmp.sh sizes_4_1000000
         memcmp-openbsd-kernel average:26762992
          memcmp-netbsd-kernel average:93858975
         memcmp-freebsd-kernel average:106406634
           memcmp-freebsd-libc average:140033561

$ sh bench-memcmp.sh sizes_8_1000000
         memcmp-openbsd-kernel average:25240628
          memcmp-netbsd-kernel average:74746023
         memcmp-freebsd-kernel average:97592034
           memcmp-freebsd-libc average:151985334

Not so trusty Sapphire Rapids:
$  sh bench-memcmp.sh sizes_8_1000000
         memcmp-openbsd-kernel average:39259794
          memcmp-netbsd-kernel average:88625505
         memcmp-freebsd-kernel average:120837995
           memcmp-freebsd-libc average:300220276

$  sh bench-memcmp.sh sizes_255_1000000
         memcmp-openbsd-kernel average:29169445
          memcmp-netbsd-kernel average:42962946
         memcmp-freebsd-kernel average:57501612
           memcmp-freebsd-libc average:84460824

-- 
Mateusz Guzik <mjguzik gmail.com>