Index | Thread | Search

From:
Otto Moerbeek <otto@drijf.net>
Subject:
Re: [PATCH] amd64: import optimized memcmp from FreeBSD
To:
Christian Schulte <cs@schulte.it>
Cc:
tech@openbsd.org
Date:
Mon, 2 Dec 2024 07:54:53 +0100

Download raw body.

Thread
On Mon, Dec 02, 2024 at 03:21:09AM +0100, Christian Schulte wrote:

> On 11/30/24 15:59, Christian Weisgerber wrote:
> > Christian Schulte:
> > 
> >> lately I thought about rewriting those functions to use simd, but never
> >> looked further into doing it, because I do not know if simd
> >> registers/instructions could be used there at all. Could they? Instead
> > 
> > Presumably not in the kernel as we would have to save/restore
> > FPU/SIMD registers around those calls.
> > 
> >> of working on the same thing every now and then, I would be in favour of
> >> providing routines using the most performant instructions the
> >> architecture has to offer (simd), if possible, and be done with it.
> >> Would not mind writing those, if they would be accepted.
> > 
> > FreeBSD already did all that work recently.
> > https://freebsdfoundation.org/blog/a-sneak-peek-simd-enhanced-string-functions-for-amd64/
> > 
> 
> Do you know if they have published some test data somewhere? Something
> which can be used to compare results. I got curious about it (again) and
> instead of directly jumping onto the assembly train, I first wanted to
> find out what the compiler will produce when writing a memcmp function
> in C. Compared to the current memcmp in base a C memcmp compiled with
> -O2 will make the compiler use SIMD instructions. I am only testing it
> when equal so that the whole length has to be scanned.
> 
> I tested various len values (0,1,2,3,4,5,6,7,8...). Either I am a total
> idiot or the C version really outperforms the assembly version. That's
> for a len == 517.
> 
>   %   cumulative   self              self     total
>  time   seconds   seconds    calls  ms/call  ms/call  name
>  72.1     220.33   220.33 1000000000     0.00     0.00  _libc_memcmp [3]
>  17.9     274.98    54.65 1000000000     0.00     0.00  c_memcmp [4]
> 
> And that's for a len == 13.
> 
>  23.6      20.67     9.04 1000000000     0.00     0.00  _libc_memcmp [4]
>  20.9      28.67     8.00 1000000000     0.00     0.00  c_memcmp [5]
> 
> This can't hardly be, can it?

I see longer times for c_memcmp, so it looks slower. _libc_memcmp is
faster for both cases.

Be very weary when doing speed tests of specfic functions using
profiling: the overhead and bias of can be big. Also, aggressive
optimizationm and inlining can skew the results.

	-Otto