Download raw body.
[PATCH] amd64: import optimized memcmp from FreeBSD
On Mon, Dec 02, 2024 at 03:21:09AM +0100, Christian Schulte wrote: > On 11/30/24 15:59, Christian Weisgerber wrote: > > Christian Schulte: > > > >> lately I thought about rewriting those functions to use simd, but never > >> looked further into doing it, because I do not know if simd > >> registers/instructions could be used there at all. Could they? Instead > > > > Presumably not in the kernel as we would have to save/restore > > FPU/SIMD registers around those calls. > > > >> of working on the same thing every now and then, I would be in favour of > >> providing routines using the most performant instructions the > >> architecture has to offer (simd), if possible, and be done with it. > >> Would not mind writing those, if they would be accepted. > > > > FreeBSD already did all that work recently. > > https://freebsdfoundation.org/blog/a-sneak-peek-simd-enhanced-string-functions-for-amd64/ > > > > Do you know if they have published some test data somewhere? Something > which can be used to compare results. I got curious about it (again) and > instead of directly jumping onto the assembly train, I first wanted to > find out what the compiler will produce when writing a memcmp function > in C. Compared to the current memcmp in base a C memcmp compiled with > -O2 will make the compiler use SIMD instructions. I am only testing it > when equal so that the whole length has to be scanned. > > I tested various len values (0,1,2,3,4,5,6,7,8...). Either I am a total > idiot or the C version really outperforms the assembly version. That's > for a len == 517. > > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 72.1 220.33 220.33 1000000000 0.00 0.00 _libc_memcmp [3] > 17.9 274.98 54.65 1000000000 0.00 0.00 c_memcmp [4] > > And that's for a len == 13. > > 23.6 20.67 9.04 1000000000 0.00 0.00 _libc_memcmp [4] > 20.9 28.67 8.00 1000000000 0.00 0.00 c_memcmp [5] > > This can't hardly be, can it? I see longer times for c_memcmp, so it looks slower. _libc_memcmp is faster for both cases. Be very weary when doing speed tests of specfic functions using profiling: the overhead and bias of can be big. Also, aggressive optimizationm and inlining can skew the results. -Otto
[PATCH] amd64: import optimized memcmp from FreeBSD