Index | Thread | Search

From:
Mateusz Guzik <mjguzik@gmail.com>
Subject:
Re: [PATCH] amd64: import optimized memcmp from FreeBSD
To:
Mark Kettenis <mark.kettenis@xs4all.nl>
Cc:
tech@openbsd.org
Date:
Sat, 30 Nov 2024 04:03:47 +0100

Download raw body.

Thread
  • Mark Kettenis:

    [PATCH] amd64: import optimized memcmp from FreeBSD

  • On Fri, Nov 29, 2024 at 11:08 PM Mark Kettenis <mark.kettenis@xs4all.nl> wrote:
    >
    > > From: Mateusz Guzik <mjguzik@gmail.com>
    > > Date: Fri, 29 Nov 2024 02:01:45 +0100
    > >
    > > The rep-prefixed cmps is incredibly slow even on modern CPUs.
    > >
    > > The new implementation uses regular cmp to do it.
    > >
    > > The code got augmented to account for retguard, otherwise it matches FreeBSD.
    > >
    > > On Sapphire Rapids open1 (open + close in a loop) from will-it-scale (ops/s):
    > > before:       436177
    > > after:        464670 (+6.5%)
    >
    > That is a nice speedup, but you'd need to present more than a single
    > microbenchmark on a single microarchitecture to convince me that this
    > is worth the extra complication of the assembly code.
    >
    
    I would agree if it was not for the fact that this is a core string
    primitive, not some one-of use somewhere. There is no excuse for any
    of these to straight up suck, which unfortunately they *ALL* do --
    overall the set (memcpy, bcopy, memcmp, memset, copyinstr and so on)
    on amd64 was slapped together years ago in a way which severely
    hindered performance even on CPUs relevant for the time period they
    showed up in.
    
    Perhaps one could wonder if there are bugs introduced here and that of
    course is a possibility, but the thing is in use for ~5 years now so I
    wager it works fine.
    
    What one could question is whether the routine is optimal and here I
    can say it avoids all traps that I know of and it got to a stage where
    it is a tradeoff city -- you don't get to speed it up without slowing
    something else down. It is however a net win all over compared to the
    stock routine.
    
    I did a major rewrite years ago and got nice wins for it in FreeBSD,
    all the routines are up for grabs for interested parties.
    
    The key culprit is the rep prefix which suffers significant startup
    latency, making it completely unsuitable for small sizes (depends on
    the uarch, but at least < 64 bytes which is vast majority of
    real-world lengths). Some CPUs claim to have fast short rep for some
    of the instructions, but even then they are very new and don't have it
    for everything.
    
    Another funzy is the use of lodsb/stosb in copyinstr which suffers
    completely avoidable latency issues and can be trivially replaced.
    
    Sample bench result on that one from years back:
        A simple test on Intel(R) Core(TM) i7-4600U CPU @ 2.10GH copying
        /foo/bar/baz in a loop goes from 295715863 ops/s to 465807408.
    
    Anyhow I would encourage interested parties to grab the rest, I'm not
    going to do any more work here.
    
    
    
  • Mark Kettenis:

    [PATCH] amd64: import optimized memcmp from FreeBSD