From: Mateusz Guzik <mjguzik@gmail.com>
Subject: Re: improve spinning in mtx_enter
To: Alexander Bluhm <alexander.bluhm@gmx.net>
Cc: tech@openbsd.org
Date: Thu, 21 Mar 2024 13:36:25 +0100

On Thu, Mar 21, 2024 at 1:22 PM Alexander Bluhm <alexander.bluhm@gmx.net> wrote:
>
> On Wed, Mar 20, 2024 at 01:43:18PM +0100, Mateusz Guzik wrote:
> > few years back I noted that migration from MD mutex code to MI
> > implementation happened to regress performance at least on amd64.
> >
> > The MI implementation fails to check if the lock is free before issuing
> > CAS, and only spins once before attempts. This avoidably reduces
> > performance.
> >
> > While more can be done here, bare minimun the code needs to NULL check,
> > which I implemented below.
> >
> > Results on 7.5 (taken from snapshots) timing make -ss -j 16 in the
> > kernel dir:
> >
> > before: 521.37s user 524.69s system 1080% cpu 1:36.79 total
> > after:  522.76s user 486.87s system 1088% cpu 1:32.79 total
> >
> > That is about 4% reduction in total real time for a rather trivial
> > change.
>
> I refactored the diff a bit to keep the while loop.
>
> With that I get 5% reduction in sys time during kernel build on a
> 8 CPU machine.  It has two sockets with 4 cores each.
>
> Flame graphs of kernel build are here:
> http://bluhm.genua.de/perform/results/2024-03-19T23:41:35Z/2024-03-19T00%3A00%3A00Z/btrace/time_-lp_make_-CGENERIC.MP_-j8_-s-btrace-kstack.0.svg
> http://bluhm.genua.de/perform/results/2024-03-19T23:41:35Z/patch-sys-mutex-spin.1/btrace/time_-lp_make_-CGENERIC.MP_-j8_-s-btrace-kstack.0.svg
>
> If you zoom into kernel and search for mtx_enter, you see that
> spinning time goes down from 13.1% to 10.7%
>
> Should we commit this?
>
> bluhm
>
> Index: kern/kern_lock.c
> ===================================================================
> RCS file: /data/mirror/openbsd/cvs/src/sys/kern/kern_lock.c,v
> diff -u -p -r1.72 kern_lock.c
> --- kern/kern_lock.c    26 Apr 2022 15:31:14 -0000      1.72
> +++ kern/kern_lock.c    21 Mar 2024 12:11:20 -0000
> @@ -264,15 +264,17 @@ mtx_enter(struct mutex *mtx)
>
>         spc->spc_spinning++;
>         while (mtx_enter_try(mtx) == 0) {
> -               CPU_BUSY_CYCLE();
> -
> +               do {
> +                       CPU_BUSY_CYCLE();
>  #ifdef MP_LOCKDEBUG
> -               if (--nticks == 0) {
> -                       db_printf("%s: %p lock spun out\n", __func__, mtx);
> -                       db_enter();
> -                       nticks = __mp_lock_spinout;
> -               }
> +                       if (--nticks == 0) {
> +                               db_printf("%s: %p lock spun out\n",
> +                                   __func__, mtx);
> +                               db_enter();
> +                               nticks = __mp_lock_spinout;
> +                       }
>  #endif
> +               } while (mtx->mtx_owner != NULL);
>         }
>         spc->spc_spinning--;
>  }

I also got about 5% in terms of system time.

I'm not going to argue about the style, but I'm going to note some
extra stuff. :)

2 speed ups on top of this would come from:
1. reducing lock access in mtx_enter_try to begin with. most notably
IPL level could be read off only once, not every single time
execuction lands there
2. backoff could be added, even something as simple as exponentially
going to up to 8 spins would be an improvement with more cpus

however, vast majority of the win is already there with the initial patch

That aside I think there is an avoidable overcounting of spinning --
even the initial attempt at taking the lock is enclosed by
spc_spinning. I would suggest prefixing the code with a mere
mtx_enter_try. Should that fail, drop down to a do { } while loop
which spins first and tries to lock later.

-- 
Mateusz Guzik <mjguzik gmail.com>