From: Jeremie Courreges-Anglas <jca@wxcvbn.org>
Subject: Re: improve spinning in mtx_enter
To: Mark Kettenis <mark.kettenis@xs4all.nl>
Cc: mpi@openbsd.org, alexander.bluhm@gmx.net, mjguzik@gmail.com, tech@openbsd.org
Date: Sun, 24 Mar 2024 13:23:39 +0100

Le Sun, Mar 24, 2024 at 12:39:28PM +0100, Mark Kettenis a écrit :
> > Date: Sun, 24 Mar 2024 11:53:14 +0100
> > From: Mark Kettenis <mark.kettenis@xs4all.nl>
> > 
> > > Date: Sun, 24 Mar 2024 11:09:12 +0100
> > > From: Martin Pieuchot <mpi@openbsd.org>
> > 
> > Alexander, your mail still (again?) falls into the spam trap.  So I
> > didn't see your mail.
> > 
> > > On 21/03/24(Thu) 13:22, Alexander Bluhm wrote:
> > > > On Wed, Mar 20, 2024 at 01:43:18PM +0100, Mateusz Guzik wrote:
> > > > > few years back I noted that migration from MD mutex code to MI
> > > > > implementation happened to regress performance at least on amd64.
> > > > > 
> > > > > The MI implementation fails to check if the lock is free before issuing
> > > > > CAS, and only spins once before attempts. This avoidably reduces
> > > > > performance.
> > > > > 
> > > > > While more can be done here, bare minimun the code needs to NULL check,
> > > > > which I implemented below.
> > > > > 
> > > > > Results on 7.5 (taken from snapshots) timing make -ss -j 16 in the
> > > > > kernel dir:
> > > > > 
> > > > > before: 521.37s user 524.69s system 1080% cpu 1:36.79 total
> > > > > after:  522.76s user 486.87s system 1088% cpu 1:32.79 total
> > > > > 
> > > > > That is about 4% reduction in total real time for a rather trivial
> > > > > change.
> > > > 
> > > > I refactored the diff a bit to keep the while loop.
> > > > 
> > > > With that I get 5% reduction in sys time during kernel build on a
> > > > 8 CPU machine.  It has two sockets with 4 cores each.
> > > > 
> > > > Flame graphs of kernel build are here:
> > > > http://bluhm.genua.de/perform/results/2024-03-19T23:41:35Z/2024-03-19T00%3A00%3A00Z/btrace/time_-lp_make_-CGENERIC.MP_-j8_-s-btrace-kstack.0.svg
> > > > http://bluhm.genua.de/perform/results/2024-03-19T23:41:35Z/patch-sys-mutex-spin.1/btrace/time_-lp_make_-CGENERIC.MP_-j8_-s-btrace-kstack.0.svg
> > > > 
> > > > If you zoom into kernel and search for mtx_enter, you see that
> > > > spinning time goes down from 13.1% to 10.7%
> > 
> > I assume that's an amd64 machine?  Can we test this on some other
> > architectures too before we commit it?  I don't expect a regression,
> > but it would be interesting to see what the effect is on an
> > architectures that has better atomic operations than amd64.
> 
> FWIW, I see no measureable difference on a kernel build on my 10-core
> M2 Pro Mac mini with this diff.  (And that is fine).

Same on my single-socket 8 cores amd64 builder.

-- 
jca