Download raw body.
[PATCH] amd64: import optimized memcmp from FreeBSD
On Mon, Dec 02, 2024 at 02:21:09AM +0000, Christian Schulte wrote: > On 11/30/24 15:59, Christian Weisgerber wrote: > > Christian Schulte: > > > >> lately I thought about rewriting those functions to use simd, but never > >> looked further into doing it, because I do not know if simd > >> registers/instructions could be used there at all. Could they? Instead > > > > Presumably not in the kernel as we would have to save/restore > > FPU/SIMD registers around those calls. > > > >> of working on the same thing every now and then, I would be in favour of > >> providing routines using the most performant instructions the > >> architecture has to offer (simd), if possible, and be done with it. > >> Would not mind writing those, if they would be accepted. > > > > FreeBSD already did all that work recently. > > https://freebsdfoundation.org/blog/a-sneak-peek-simd-enhanced-string-functions-for-amd64/ > > > > Do you know if they have published some test data somewhere? Something > which can be used to compare results. I got curious about it (again) and > instead of directly jumping onto the assembly train, I first wanted to > find out what the compiler will produce when writing a memcmp function > in C. Compared to the current memcmp in base a C memcmp compiled with > -O2 will make the compiler use SIMD instructions. I am only testing it > when equal so that the whole length has to be scanned. > > I tested various len values (0,1,2,3,4,5,6,7,8...). Either I am a total > idiot or the C version really outperforms the assembly version. That's > for a len == 517. > > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 72.1 220.33 220.33 1000000000 0.00 0.00 _libc_memcmp [3] > 17.9 274.98 54.65 1000000000 0.00 0.00 c_memcmp [4] > > And that's for a len == 13. > > 23.6 20.67 9.04 1000000000 0.00 0.00 _libc_memcmp [4] > 20.9 28.67 8.00 1000000000 0.00 0.00 c_memcmp [5] > > This can't hardly be, can it? > As I already mentioned the current asm routines are of very poor quality and one would have to try to not do better. However, I'm deeply confused what exactly have you tested above. OpenBSD libc for the amd64 platform uses a C variant of the memcmp routine (as opposed to the kernel which has the problematic rep variant) The compiler did not emit SIMD for it in libc. Compiling the lib/libc/string/memcmp.c file by hand with cc -O2 -c gives the same assembly assembly as libc. I'm trying this out on OpenBSD 7.6 with clang 16.0.6. Anyhow yes, the compiler *can* emit some simd here and it even got extensions to support overlapping stores. Personally I benchmarked with this: https://people.freebsd.org/~mjg/.junk/will-it-scale-memcmp.tgz $ sh compile-memcmp.sh Best used on FreeBSD or Linux due to CPU pinning. If you insist on OpenBSD you will need to replace main.c with: https://people.freebsd.org/~mjg/.junk/64/main.c and whack -lhwloc lines bench with random sizes up to 8 bytes, 1 mln samples: $ sh bench-memcmp.sh sizes_8_1000000 The archive ships with other seeds. More can be generated with gen-sizes.c My work was concerning the kernel variant, where passed sizes very heavily skew to < 8 and I added an upper a limit which fits in 1 byte to reduce cache pressure while iterating the set. This probably can be bumped to short. Anyhow. FreeBSD libc ships with a SSE2 variant of memcmp (written by someone else). SSE2 being part of AMD64 ISA means it can be used as is, no ifunc. Shipped in this thread is a SIMD-less variant written by me specifically for kernel usage. What can be done for the kernel for nice speed ups is porting over the MEMMOVE macro and its associated uses. See: https://cgit.freebsd.org/src/tree/sys/amd64/amd64/support.S#n271 This includes memcpy, memmove, copyin, copyout. Apart from that there is patched up memset and copystr/copyinstr. Important note for libc is that memcpy/memmove/bcopy and memset/bzero remain simdless and thus can be imported with 0 hassle, in particular they are safe to use from the loader itself. These are one of the most commonly used as well, so I would say importing right now is an easy call. Worst case someone can add something faster later. Routines which do use SIMD may or may not require stubs for the loader.
[PATCH] amd64: import optimized memcmp from FreeBSD