Download raw body.
[PATCH] amd64: import optimized memcmp from FreeBSD
On Fri, Nov 29, 2024 at 11:08 PM Mark Kettenis <mark.kettenis@xs4all.nl> wrote:
>
> > From: Mateusz Guzik <mjguzik@gmail.com>
> > Date: Fri, 29 Nov 2024 02:01:45 +0100
> >
> > The rep-prefixed cmps is incredibly slow even on modern CPUs.
> >
> > The new implementation uses regular cmp to do it.
> >
> > The code got augmented to account for retguard, otherwise it matches FreeBSD.
> >
> > On Sapphire Rapids open1 (open + close in a loop) from will-it-scale (ops/s):
> > before: 436177
> > after: 464670 (+6.5%)
>
> That is a nice speedup, but you'd need to present more than a single
> microbenchmark on a single microarchitecture to convince me that this
> is worth the extra complication of the assembly code.
>
I would agree if it was not for the fact that this is a core string
primitive, not some one-of use somewhere. There is no excuse for any
of these to straight up suck, which unfortunately they *ALL* do --
overall the set (memcpy, bcopy, memcmp, memset, copyinstr and so on)
on amd64 was slapped together years ago in a way which severely
hindered performance even on CPUs relevant for the time period they
showed up in.
Perhaps one could wonder if there are bugs introduced here and that of
course is a possibility, but the thing is in use for ~5 years now so I
wager it works fine.
What one could question is whether the routine is optimal and here I
can say it avoids all traps that I know of and it got to a stage where
it is a tradeoff city -- you don't get to speed it up without slowing
something else down. It is however a net win all over compared to the
stock routine.
I did a major rewrite years ago and got nice wins for it in FreeBSD,
all the routines are up for grabs for interested parties.
The key culprit is the rep prefix which suffers significant startup
latency, making it completely unsuitable for small sizes (depends on
the uarch, but at least < 64 bytes which is vast majority of
real-world lengths). Some CPUs claim to have fast short rep for some
of the instructions, but even then they are very new and don't have it
for everything.
Another funzy is the use of lodsb/stosb in copyinstr which suffers
completely avoidable latency issues and can be trivially replaced.
Sample bench result on that one from years back:
A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying
/foo/bar/baz in a loop goes from 295715863 ops/s to 465807408.
Anyhow I would encourage interested parties to grab the rest, I'm not
going to do any more work here.
[PATCH] amd64: import optimized memcmp from FreeBSD