From: David Gwynne Subject: Re: mtx_enter() fix for huge systems To: Mark Kettenis Cc: Martin Pieuchot , tech@openbsd.org Date: Tue, 20 May 2025 08:40:12 +1000 On Mon, May 19, 2025 at 10:55:21PM +0200, Mark Kettenis wrote: > > Date: Mon, 19 May 2025 16:28:31 +0200 > > From: Martin Pieuchot > > Hi Martin, > > > Last May the SCHED_LOCK() has been converted to a mutex. This change > > exposed, on big arm64 machine like the Ampere Altra, a flaw in the MI > > mutex (spinlock). Then in July the weakness got partially fixed by > > using ARMv8.1 LSE instructions such as CAS. Sadly the issue is still > > present and can hang the machine as soon as a given mutex is too > > contended. > > > > Diff below contains two changes to mitigate cache line contention and > > with it my 80 CPUs Ampere Altra is now stable: > > > > - Always read the value just before issuing an atomic operation. It > > might have changed during splraise() and we want to avoid locking > > the cache line. A similar change has been made recently in rwlocks. > > I think that's a no-brainer. It is good practice in general and I'm > frankly surprised we didn't do so already. > > > - Change the busy loop when waiting for a contended lock. Instead of > > wasting a single "busy cycle" in each iteration, increase the number > > of cycles exponentially. The min & max numbers for this exponential > > backoff have been chosen carefully to not waste more cycles than in > > -current. I even observed a small reduction of %sys time on my amd64 > > with 24 CPUs. > > This is another good idea. But I'm wondering to what extent using the > same max number across architectures (and micro-architectures) makes > sense. On some architectures CPU_BUSY_CYCLE() is a NOP. On others it > is some sort of hint that may make the CPU stall for a small amount of > time. But that time will vary significantly between architectures. > Or it may end up as a NOP anyway. For example on arm64 we use the > YIELD instruction, but that instruction might only do something on SMT > systems. And those don't do actually exist in arm64 land. > > Anyway, given the relatively small value of the max value, we're > probably ok. Was there a huge different between using 32 or 64 as a > max value? it's better than the current code, and if we figure out an even better value in the future then we can change it then. > I do measure a small improvement in system time as well on the M2 Pro > with 10 cores. > > > This diff also removes the MP_LOCKDEBUG goo for mutexes. I believe that > > nowadays it creates more false positives than it is useful. So I didn't > > want to adapt the code to the new busy loop. I'd argue that we should > > get rid of the option completely and build something else like lockstat(8). > > Yes I don't think MP_LOCKDEBUG is very useful anymore. But maybe we > should keep the diagnostic in __mp_unlock()? > > > As Claudio mentioned earlier, I agree and I would welcome an MCS locks > > implementation for kernel mutexes. That said, the change below should > > be enough to unstuck my uvm_purge() diff and has benefit on its own. > > > > ok? > > ok kettenis@ ok > > > Index: kern/kern_lock.c > > =================================================================== > > RCS file: /cvs/src/sys/kern/kern_lock.c,v > > diff -u -p -r1.75 kern_lock.c > > --- kern/kern_lock.c 3 Jul 2024 01:36:50 -0000 1.75 > > +++ kern/kern_lock.c 19 May 2025 14:19:02 -0000 > > @@ -42,6 +42,21 @@ int __mp_lock_spinout = INT_MAX; > > struct __mp_lock kernel_lock; > > > > /* > > + * Min & max numbers of "busy cycles" to waste before trying again to > > + * acquire a contended lock using an atomic operation. > > + * > > + * The min number must be as small as possible to not introduce extra > > + * latency. It also doesn't matter if the first steps of an exponential > > + * backoff are smalls. > > + * > > + * The max number is used to cap the exponential backoff. It should > > + * be small enough to not waste too many cycles in %sys time and big > > + * enough to reduce (ideally avoid) cache line contention. > > + */ > > +#define CPU_MIN_BUSY_CYCLES 1 > > +#define CPU_MAX_BUSY_CYCLES 64 > > + > > +/* > > * Functions for manipulating the kernel_lock. We put them here > > * so that they show up in profiles. > > */ > > @@ -228,9 +243,7 @@ void > > mtx_enter(struct mutex *mtx) > > { > > struct schedstate_percpu *spc = &curcpu()->ci_schedstate; > > -#ifdef MP_LOCKDEBUG > > - int nticks = __mp_lock_spinout; > > -#endif > > + unsigned int i, ncycle = CPU_MIN_BUSY_CYCLES; > > > > WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx), > > LOP_EXCLUSIVE | LOP_NEWORDER, NULL); > > @@ -238,15 +251,11 @@ mtx_enter(struct mutex *mtx) > > spc->spc_spinning++; > > while (mtx_enter_try(mtx) == 0) { > > do { > > - CPU_BUSY_CYCLE(); > > -#ifdef MP_LOCKDEBUG > > - if (--nticks == 0) { > > - db_printf("%s: %p lock spun out\n", > > - __func__, mtx); > > - db_enter(); > > - nticks = __mp_lock_spinout; > > - } > > -#endif > > + /* Busy loop with exponential backoff. */ > > + for (i = ncycle; i > 0; i--) > > + CPU_BUSY_CYCLE(); > > + if (ncycle < CPU_MAX_BUSY_CYCLES) > > + ncycle += ncycle; > > } while (mtx->mtx_owner != NULL); > > } > > spc->spc_spinning--; > > @@ -265,20 +274,27 @@ mtx_enter_try(struct mutex *mtx) > > if (mtx->mtx_wantipl != IPL_NONE) > > s = splraise(mtx->mtx_wantipl); > > > > - owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci); > > + /* > > + * Avoid unconditional atomic operation to prevent cache line > > + * contention. > > + */ > > + owner = mtx->mtx_owner; > > #ifdef DIAGNOSTIC > > if (__predict_false(owner == ci)) > > panic("mtx %p: locking against myself", mtx); > > #endif > > if (owner == NULL) { > > - membar_enter_after_atomic(); > > - if (mtx->mtx_wantipl != IPL_NONE) > > - mtx->mtx_oldipl = s; > > + owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci); > > + if (owner == NULL) { > > + membar_enter_after_atomic(); > > + if (mtx->mtx_wantipl != IPL_NONE) > > + mtx->mtx_oldipl = s; > > #ifdef DIAGNOSTIC > > - ci->ci_mutex_level++; > > + ci->ci_mutex_level++; > > #endif > > - WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE); > > - return (1); > > + WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE); > > + return (1); > > + } > > } > > > > if (mtx->mtx_wantipl != IPL_NONE) > > @@ -353,6 +369,7 @@ void > > db_mtx_enter(struct db_mutex *mtx) > > { > > struct cpu_info *ci = curcpu(), *owner; > > + unsigned int i, ncycle = CPU_MIN_BUSY_CYCLES; > > unsigned long s; > > > > #ifdef DIAGNOSTIC > > @@ -361,12 +378,22 @@ db_mtx_enter(struct db_mutex *mtx) > > #endif > > > > s = intr_disable(); > > - > > for (;;) { > > - owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci); > > - if (owner == NULL) > > - break; > > - CPU_BUSY_CYCLE(); > > + /* > > + * Avoid unconditional atomic operation to prevent cache > > + * line contention. > > + */ > > + owner = mtx->mtx_owner; > > + if (owner == NULL) { > > + owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci); > > + if (owner == NULL) > > + break; > > + /* Busy loop with exponential backoff. */ > > + for (i = ncycle; i > 0; i--) > > + CPU_BUSY_CYCLE(); > > + if (ncycle < CPU_MAX_BUSY_CYCLES) > > + ncycle += ncycle; > > + } > > } > > membar_enter_after_atomic(); > > > > > > > > >