Index | Thread | Search

From:
Mateusz Guzik <mjguzik@gmail.com>
Subject:
Re: [PATCH] amd64: import optimized memcmp from FreeBSD
To:
Christian Schulte <cs@schulte.it>
Cc:
tech@openbsd.org, deraadt@openbsd.org
Date:
Mon, 2 Dec 2024 07:57:36 +0100

Download raw body.

Thread
On Mon, Dec 02, 2024 at 02:21:09AM +0000, Christian Schulte wrote:
> On 11/30/24 15:59, Christian Weisgerber wrote:
> > Christian Schulte:
> > 
> >> lately I thought about rewriting those functions to use simd, but never
> >> looked further into doing it, because I do not know if simd
> >> registers/instructions could be used there at all. Could they? Instead
> > 
> > Presumably not in the kernel as we would have to save/restore
> > FPU/SIMD registers around those calls.
> > 
> >> of working on the same thing every now and then, I would be in favour of
> >> providing routines using the most performant instructions the
> >> architecture has to offer (simd), if possible, and be done with it.
> >> Would not mind writing those, if they would be accepted.
> > 
> > FreeBSD already did all that work recently.
> > https://freebsdfoundation.org/blog/a-sneak-peek-simd-enhanced-string-functions-for-amd64/
> > 
> 
> Do you know if they have published some test data somewhere? Something
> which can be used to compare results. I got curious about it (again) and
> instead of directly jumping onto the assembly train, I first wanted to
> find out what the compiler will produce when writing a memcmp function
> in C. Compared to the current memcmp in base a C memcmp compiled with
> -O2 will make the compiler use SIMD instructions. I am only testing it
> when equal so that the whole length has to be scanned.
> 
> I tested various len values (0,1,2,3,4,5,6,7,8...). Either I am a total
> idiot or the C version really outperforms the assembly version. That's
> for a len == 517.
> 
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  72.1     220.33   220.33 1000000000     0.00     0.00  _libc_memcmp [3]
>  17.9     274.98    54.65 1000000000     0.00     0.00  c_memcmp [4]
> 
> And that's for a len == 13.
> 
>  23.6      20.67     9.04 1000000000     0.00     0.00  _libc_memcmp [4]
>  20.9      28.67     8.00 1000000000     0.00     0.00  c_memcmp [5]
> 
> This can't hardly be, can it?
> 

As I already mentioned the current asm routines are of very poor quality
and one would have to try to not do better.

However, I'm deeply confused what exactly have you tested above.

OpenBSD libc for the amd64 platform uses a C variant of the memcmp
routine (as opposed to the kernel which has the problematic rep variant)
The compiler did not emit SIMD for it in libc.

Compiling the lib/libc/string/memcmp.c file by hand with cc -O2 -c gives
the same assembly assembly as libc. I'm trying this out on OpenBSD 7.6
with clang 16.0.6.

Anyhow yes, the compiler *can* emit some simd here and it even got
extensions to support overlapping stores.

Personally I benchmarked with this:
https://people.freebsd.org/~mjg/.junk/will-it-scale-memcmp.tgz

$ sh compile-memcmp.sh

Best used on FreeBSD or Linux due to CPU pinning. If you insist on
OpenBSD you will need to replace main.c with:
https://people.freebsd.org/~mjg/.junk/64/main.c

and whack -lhwloc lines

bench with random sizes up to 8 bytes, 1 mln samples:
$ sh bench-memcmp.sh sizes_8_1000000

The archive ships with other seeds. More can be generated with gen-sizes.c

My work was concerning the kernel variant, where passed sizes very
heavily skew to < 8 and I added an upper a limit which fits in 1 byte to
reduce cache pressure while iterating the set. This probably can be
bumped to short.

Anyhow.

FreeBSD libc ships with a SSE2 variant of memcmp (written by someone
else). SSE2 being part of AMD64 ISA means it can be used as is, no
ifunc.

Shipped in this thread is a SIMD-less variant written by me specifically
for kernel usage.

What can be done for the kernel for nice speed ups is porting over the
MEMMOVE macro and its associated uses. See:
https://cgit.freebsd.org/src/tree/sys/amd64/amd64/support.S#n271

This includes memcpy, memmove, copyin, copyout.

Apart from that there is patched up memset and copystr/copyinstr.

Important note for libc is that memcpy/memmove/bcopy and memset/bzero
remain simdless and thus can be imported with 0 hassle, in particular
they are safe to use from the loader itself.

These are one of the most commonly used as well, so I would say
importing right now is an easy call. Worst case someone can add
something faster later.

Routines which do use SIMD may or may not require stubs for the loader.