From: Mateusz Guzik Subject: Re: [PATCH] amd64: import optimized memcmp from FreeBSD To: Christian Schulte Cc: tech@openbsd.org, deraadt@openbsd.org Date: Mon, 2 Dec 2024 08:03:30 +0100 On Mon, Dec 2, 2024 at 7:57 AM Mateusz Guzik wrote: > > On Mon, Dec 02, 2024 at 02:21:09AM +0000, Christian Schulte wrote: > > On 11/30/24 15:59, Christian Weisgerber wrote: > > > Christian Schulte: > > > > > >> lately I thought about rewriting those functions to use simd, but never > > >> looked further into doing it, because I do not know if simd > > >> registers/instructions could be used there at all. Could they? Instead > > > > > > Presumably not in the kernel as we would have to save/restore > > > FPU/SIMD registers around those calls. > > > > > >> of working on the same thing every now and then, I would be in favour of > > >> providing routines using the most performant instructions the > > >> architecture has to offer (simd), if possible, and be done with it. > > >> Would not mind writing those, if they would be accepted. > > > > > > FreeBSD already did all that work recently. > > > https://freebsdfoundation.org/blog/a-sneak-peek-simd-enhanced-string-functions-for-amd64/ > > > > > > > Do you know if they have published some test data somewhere? Something > > which can be used to compare results. I got curious about it (again) and > > instead of directly jumping onto the assembly train, I first wanted to > > find out what the compiler will produce when writing a memcmp function > > in C. Compared to the current memcmp in base a C memcmp compiled with > > -O2 will make the compiler use SIMD instructions. I am only testing it > > when equal so that the whole length has to be scanned. > > > > I tested various len values (0,1,2,3,4,5,6,7,8...). Either I am a total > > idiot or the C version really outperforms the assembly version. That's > > for a len == 517. > > > > % cumulative self self total > > time seconds seconds calls ms/call ms/call name > > 72.1 220.33 220.33 1000000000 0.00 0.00 _libc_memcmp [3] > > 17.9 274.98 54.65 1000000000 0.00 0.00 c_memcmp [4] > > > > And that's for a len == 13. > > > > 23.6 20.67 9.04 1000000000 0.00 0.00 _libc_memcmp [4] > > 20.9 28.67 8.00 1000000000 0.00 0.00 c_memcmp [5] > > > > This can't hardly be, can it? > > > > As I already mentioned the current asm routines are of very poor quality > and one would have to try to not do better. > > However, I'm deeply confused what exactly have you tested above. > > OpenBSD libc for the amd64 platform uses a C variant of the memcmp > routine (as opposed to the kernel which has the problematic rep variant) > The compiler did not emit SIMD for it in libc. > > Compiling the lib/libc/string/memcmp.c file by hand with cc -O2 -c gives > the same assembly assembly as libc. I'm trying this out on OpenBSD 7.6 > with clang 16.0.6. > > Anyhow yes, the compiler *can* emit some simd here and it even got > extensions to support overlapping stores. > > Personally I benchmarked with this: > https://people.freebsd.org/~mjg/.junk/will-it-scale-memcmp.tgz > > $ sh compile-memcmp.sh > > Best used on FreeBSD or Linux due to CPU pinning. If you insist on > OpenBSD you will need to replace main.c with: > https://people.freebsd.org/~mjg/.junk/64/main.c > > and whack -lhwloc lines > > bench with random sizes up to 8 bytes, 1 mln samples: > $ sh bench-memcmp.sh sizes_8_1000000 > > The archive ships with other seeds. More can be generated with gen-sizes.c > > My work was concerning the kernel variant, where passed sizes very > heavily skew to < 8 and I added an upper a limit which fits in 1 byte to > reduce cache pressure while iterating the set. This probably can be > bumped to short. > > Anyhow. > > FreeBSD libc ships with a SSE2 variant of memcmp (written by someone > else). SSE2 being part of AMD64 ISA means it can be used as is, no > ifunc. > > Shipped in this thread is a SIMD-less variant written by me specifically > for kernel usage. > > What can be done for the kernel for nice speed ups is porting over the > MEMMOVE macro and its associated uses. See: > https://cgit.freebsd.org/src/tree/sys/amd64/amd64/support.S#n271 > > This includes memcpy, memmove, copyin, copyout. > > Apart from that there is patched up memset and copystr/copyinstr. > > Important note for libc is that memcpy/memmove/bcopy and memset/bzero > remain simdless and thus can be imported with 0 hassle, in particular > they are safe to use from the loader itself. > > These are one of the most commonly used as well, so I would say > importing right now is an easy call. Worst case someone can add > something faster later. > > Routines which do use SIMD may or may not require stubs for the loader. Of course forgot to add some results: Trusty Haswell: $ sh bench-memcmp.sh sizes_4_1000000 memcmp-openbsd-kernel average:26762992 memcmp-netbsd-kernel average:93858975 memcmp-freebsd-kernel average:106406634 memcmp-freebsd-libc average:140033561 $ sh bench-memcmp.sh sizes_8_1000000 memcmp-openbsd-kernel average:25240628 memcmp-netbsd-kernel average:74746023 memcmp-freebsd-kernel average:97592034 memcmp-freebsd-libc average:151985334 Not so trusty Sapphire Rapids: $ sh bench-memcmp.sh sizes_8_1000000 memcmp-openbsd-kernel average:39259794 memcmp-netbsd-kernel average:88625505 memcmp-freebsd-kernel average:120837995 memcmp-freebsd-libc average:300220276 $ sh bench-memcmp.sh sizes_255_1000000 memcmp-openbsd-kernel average:29169445 memcmp-netbsd-kernel average:42962946 memcmp-freebsd-kernel average:57501612 memcmp-freebsd-libc average:84460824 -- Mateusz Guzik