Mailing List Archive

On Fri, Nov 29, 2024 at 11:08 PM Mark Kettenis <mark.kettenis@xs4all.nl> wrote: > > > From: Mateusz Guzik <mjguzik@gmail.com> > > Date: Fri, 29 Nov 2024 02:01:45 +0100 > > > > The rep-prefixed cmps is incredibly slow even on modern CPUs. > > > > The new implementation uses regular cmp to do it. > > > > The code got augmented to account for retguard, otherwise it matches FreeBSD. > > > > On Sapphire Rapids open1 (open + close in a loop) from will-it-scale (ops/s): > > before: 436177 > > after: 464670 (+6.5%) > > That is a nice speedup, but you'd need to present more than a single > microbenchmark on a single microarchitecture to convince me that this > is worth the extra complication of the assembly code. > I would agree if it was not for the fact that this is a core string primitive, not some one-of use somewhere. There is no excuse for any of these to straight up suck, which unfortunately they *ALL* do -- overall the set (memcpy, bcopy, memcmp, memset, copyinstr and so on) on amd64 was slapped together years ago in a way which severely hindered performance even on CPUs relevant for the time period they showed up in. Perhaps one could wonder if there are bugs introduced here and that of course is a possibility, but the thing is in use for ~5 years now so I wager it works fine. What one could question is whether the routine is optimal and here I can say it avoids all traps that I know of and it got to a stage where it is a tradeoff city -- you don't get to speed it up without slowing something else down. It is however a net win all over compared to the stock routine. I did a major rewrite years ago and got nice wins for it in FreeBSD, all the routines are up for grabs for interested parties. The key culprit is the rep prefix which suffers significant startup latency, making it completely unsuitable for small sizes (depends on the uarch, but at least < 64 bytes which is vast majority of real-world lengths). Some CPUs claim to have fast short rep for some of the instructions, but even then they are very new and don't have it for everything. Another funzy is the use of lodsb/stosb in copyinstr which suffers completely avoidable latency issues and can be trivially replaced. Sample bench result on that one from years back: A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying /foo/bar/baz in a loop goes from 295715863 ops/s to 465807408. Anyhow I would encourage interested parties to grab the rest, I'm not going to do any more work here.

- 2024-12-02 19:31 Steffen Nurpmeso:
  [PATCH] amd64: import optimized memcmp from FreeBSD

2024-11-29 22:08 Mark Kettenis:

[PATCH] amd64: import optimized memcmp from FreeBSD

2024-11-30 03:03 Mateusz Guzik:
[PATCH] amd64: import optimized memcmp from FreeBSD
- 2024-11-30 05:48 Mateusz Guzik:
  [PATCH] amd64: import optimized memcmp from FreeBSD
2024-11-30 14:45 Martin Pieuchot:
[PATCH] amd64: import optimized memcmp from FreeBSD