From: David Gwynne <david@gwynne.id.au>
Subject: Re: mtx_enter() fix for huge systems
To: Mark Kettenis <mark.kettenis@xs4all.nl>
Cc: Martin Pieuchot <mpi@grenadille.net>, tech@openbsd.org
Date: Tue, 20 May 2025 08:40:12 +1000

On Mon, May 19, 2025 at 10:55:21PM +0200, Mark Kettenis wrote:
> > Date: Mon, 19 May 2025 16:28:31 +0200
> > From: Martin Pieuchot <mpi@grenadille.net>
> 
> Hi Martin,
> 
> > Last May the SCHED_LOCK() has been converted to a mutex.  This change
> > exposed, on big arm64 machine like the Ampere Altra, a flaw in the MI
> > mutex (spinlock).  Then in July the weakness got partially fixed by
> > using ARMv8.1 LSE instructions such as CAS.  Sadly the issue is still
> > present and can hang the machine as soon as a given mutex is too
> > contended.
> > 
> > Diff below contains two changes to mitigate cache line contention and
> > with it my 80 CPUs Ampere Altra is now stable:
> > 
> > - Always read the value just before issuing an atomic operation.  It
> >   might have changed during splraise() and we want to avoid locking
> >   the cache line.  A similar change has been made recently in rwlocks.
> 
> I think that's a no-brainer.  It is good practice in general and I'm
> frankly surprised we didn't do so already.
> 
> > - Change the busy loop when waiting for a contended lock.  Instead of
> >   wasting a single "busy cycle" in each iteration, increase the number
> >   of cycles exponentially.  The min & max numbers for this exponential
> >   backoff have been chosen carefully to not waste more cycles than in
> >   -current.  I even observed a small reduction of %sys time on my amd64
> >   with 24 CPUs.
> 
> This is another good idea.  But I'm wondering to what extent using the
> same max number across architectures (and micro-architectures) makes
> sense.  On some architectures CPU_BUSY_CYCLE() is a NOP.  On others it
> is some sort of hint that may make the CPU stall for a small amount of
> time.  But that time will vary significantly between architectures.
> Or it may end up as a NOP anyway.  For example on arm64 we use the
> YIELD instruction, but that instruction might only do something on SMT
> systems.  And those don't do actually exist in arm64 land.
> 
> Anyway, given the relatively small value of the max value, we're
> probably ok.  Was there a huge different between using 32 or 64 as a
> max value?

it's better than the current code, and if we figure out an even
better value in the future then we can change it then.

> I do measure a small improvement in system time as well on the M2 Pro
> with 10 cores.
> 
> > This diff also removes the MP_LOCKDEBUG goo for mutexes.  I believe that
> > nowadays it creates more false positives than it is useful.  So I didn't
> > want to adapt the code to the new busy loop.  I'd argue that we should
> > get rid of the option completely and build something else like lockstat(8).
> 
> Yes I don't think MP_LOCKDEBUG is very useful anymore.  But maybe we
> should keep the diagnostic in __mp_unlock()?
> 
> > As Claudio mentioned earlier, I agree and I would welcome an MCS locks
> > implementation for kernel mutexes.  That said, the change below should
> > be enough to unstuck my uvm_purge() diff and has benefit on its own.
> > 
> > ok?
> 
> ok kettenis@

ok

> 
> > Index: kern/kern_lock.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/kern_lock.c,v
> > diff -u -p -r1.75 kern_lock.c
> > --- kern/kern_lock.c	3 Jul 2024 01:36:50 -0000	1.75
> > +++ kern/kern_lock.c	19 May 2025 14:19:02 -0000
> > @@ -42,6 +42,21 @@ int __mp_lock_spinout = INT_MAX;
> >  struct __mp_lock kernel_lock;
> >  
> >  /*
> > + * Min & max numbers of "busy cycles" to waste before trying again to
> > + * acquire a contended lock using an atomic operation.
> > + *
> > + * The min number must be as small as possible to not introduce extra
> > + * latency.  It also doesn't matter if the first steps of an exponential
> > + * backoff are smalls.
> > + *
> > + * The max number is used to cap the exponential backoff.  It should
> > + * be small enough to not waste too many cycles in %sys time and big
> > + * enough to reduce (ideally avoid) cache line contention.
> > + */
> > +#define CPU_MIN_BUSY_CYCLES	1
> > +#define CPU_MAX_BUSY_CYCLES	64
> > +
> > +/*
> >   * Functions for manipulating the kernel_lock.  We put them here
> >   * so that they show up in profiles.
> >   */
> > @@ -228,9 +243,7 @@ void
> >  mtx_enter(struct mutex *mtx)
> >  {
> >  	struct schedstate_percpu *spc = &curcpu()->ci_schedstate;
> > -#ifdef MP_LOCKDEBUG
> > -	int nticks = __mp_lock_spinout;
> > -#endif
> > +	unsigned int i, ncycle = CPU_MIN_BUSY_CYCLES;
> >  
> >  	WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx),
> >  	    LOP_EXCLUSIVE | LOP_NEWORDER, NULL);
> > @@ -238,15 +251,11 @@ mtx_enter(struct mutex *mtx)
> >  	spc->spc_spinning++;
> >  	while (mtx_enter_try(mtx) == 0) {
> >  		do {
> > -			CPU_BUSY_CYCLE();
> > -#ifdef MP_LOCKDEBUG
> > -			if (--nticks == 0) {
> > -				db_printf("%s: %p lock spun out\n",
> > -				    __func__, mtx);
> > -				db_enter();
> > -				nticks = __mp_lock_spinout;
> > -			}
> > -#endif
> > +			/* Busy loop with exponential backoff. */
> > +			for (i = ncycle; i > 0; i--)
> > +				CPU_BUSY_CYCLE();
> > +			if (ncycle < CPU_MAX_BUSY_CYCLES)
> > +				ncycle += ncycle;
> >  		} while (mtx->mtx_owner != NULL);
> >  	}
> >  	spc->spc_spinning--;
> > @@ -265,20 +274,27 @@ mtx_enter_try(struct mutex *mtx)
> >  	if (mtx->mtx_wantipl != IPL_NONE)
> >  		s = splraise(mtx->mtx_wantipl);
> >  
> > -	owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> > +	/*
> > +	 * Avoid unconditional atomic operation to prevent cache line
> > +	 * contention.
> > +	 */
> > +	owner = mtx->mtx_owner;
> >  #ifdef DIAGNOSTIC
> >  	if (__predict_false(owner == ci))
> >  		panic("mtx %p: locking against myself", mtx);
> >  #endif
> >  	if (owner == NULL) {
> > -		membar_enter_after_atomic();
> > -		if (mtx->mtx_wantipl != IPL_NONE)
> > -			mtx->mtx_oldipl = s;
> > +		owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> > +		if (owner == NULL) {
> > +			membar_enter_after_atomic();
> > +			if (mtx->mtx_wantipl != IPL_NONE)
> > +				mtx->mtx_oldipl = s;
> >  #ifdef DIAGNOSTIC
> > -		ci->ci_mutex_level++;
> > +			ci->ci_mutex_level++;
> >  #endif
> > -		WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
> > -		return (1);
> > +			WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
> > +			return (1);
> > +		}
> >  	}
> >  
> >  	if (mtx->mtx_wantipl != IPL_NONE)
> > @@ -353,6 +369,7 @@ void
> >  db_mtx_enter(struct db_mutex *mtx)
> >  {
> >  	struct cpu_info *ci = curcpu(), *owner;
> > +	unsigned int i, ncycle = CPU_MIN_BUSY_CYCLES;
> >  	unsigned long s;
> >  
> >  #ifdef DIAGNOSTIC
> > @@ -361,12 +378,22 @@ db_mtx_enter(struct db_mutex *mtx)
> >  #endif
> >  
> >  	s = intr_disable();
> > -
> >  	for (;;) {
> > -		owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> > -		if (owner == NULL)
> > -			break;
> > -		CPU_BUSY_CYCLE();
> > +		/*
> > +		 * Avoid unconditional atomic operation to prevent cache
> > +		 * line contention.
> > +		 */
> > +		owner = mtx->mtx_owner;
> > +		if (owner == NULL) {
> > +			owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> > +			if (owner == NULL)
> > +				break;
> > +			/* Busy loop with exponential backoff. */
> > +			for (i = ncycle; i > 0; i--)
> > +				CPU_BUSY_CYCLE();
> > +			if (ncycle < CPU_MAX_BUSY_CYCLES)
> > +				ncycle += ncycle;
> > +		}
> >  	}
> >  	membar_enter_after_atomic();
> >  
> > 
> > 
> > 
>