Download raw body.
mtx_enter() fix for huge systems
> Date: Mon, 19 May 2025 16:28:31 +0200
> From: Martin Pieuchot <mpi@grenadille.net>
Hi Martin,
> Last May the SCHED_LOCK() has been converted to a mutex. This change
> exposed, on big arm64 machine like the Ampere Altra, a flaw in the MI
> mutex (spinlock). Then in July the weakness got partially fixed by
> using ARMv8.1 LSE instructions such as CAS. Sadly the issue is still
> present and can hang the machine as soon as a given mutex is too
> contended.
>
> Diff below contains two changes to mitigate cache line contention and
> with it my 80 CPUs Ampere Altra is now stable:
>
> - Always read the value just before issuing an atomic operation. It
> might have changed during splraise() and we want to avoid locking
> the cache line. A similar change has been made recently in rwlocks.
I think that's a no-brainer. It is good practice in general and I'm
frankly surprised we didn't do so already.
> - Change the busy loop when waiting for a contended lock. Instead of
> wasting a single "busy cycle" in each iteration, increase the number
> of cycles exponentially. The min & max numbers for this exponential
> backoff have been chosen carefully to not waste more cycles than in
> -current. I even observed a small reduction of %sys time on my amd64
> with 24 CPUs.
This is another good idea. But I'm wondering to what extent using the
same max number across architectures (and micro-architectures) makes
sense. On some architectures CPU_BUSY_CYCLE() is a NOP. On others it
is some sort of hint that may make the CPU stall for a small amount of
time. But that time will vary significantly between architectures.
Or it may end up as a NOP anyway. For example on arm64 we use the
YIELD instruction, but that instruction might only do something on SMT
systems. And those don't do actually exist in arm64 land.
Anyway, given the relatively small value of the max value, we're
probably ok. Was there a huge different between using 32 or 64 as a
max value?
I do measure a small improvement in system time as well on the M2 Pro
with 10 cores.
> This diff also removes the MP_LOCKDEBUG goo for mutexes. I believe that
> nowadays it creates more false positives than it is useful. So I didn't
> want to adapt the code to the new busy loop. I'd argue that we should
> get rid of the option completely and build something else like lockstat(8).
Yes I don't think MP_LOCKDEBUG is very useful anymore. But maybe we
should keep the diagnostic in __mp_unlock()?
> As Claudio mentioned earlier, I agree and I would welcome an MCS locks
> implementation for kernel mutexes. That said, the change below should
> be enough to unstuck my uvm_purge() diff and has benefit on its own.
>
> ok?
ok kettenis@
> Index: kern/kern_lock.c
> ===================================================================
> RCS file: /cvs/src/sys/kern/kern_lock.c,v
> diff -u -p -r1.75 kern_lock.c
> --- kern/kern_lock.c 3 Jul 2024 01:36:50 -0000 1.75
> +++ kern/kern_lock.c 19 May 2025 14:19:02 -0000
> @@ -42,6 +42,21 @@ int __mp_lock_spinout = INT_MAX;
> struct __mp_lock kernel_lock;
>
> /*
> + * Min & max numbers of "busy cycles" to waste before trying again to
> + * acquire a contended lock using an atomic operation.
> + *
> + * The min number must be as small as possible to not introduce extra
> + * latency. It also doesn't matter if the first steps of an exponential
> + * backoff are smalls.
> + *
> + * The max number is used to cap the exponential backoff. It should
> + * be small enough to not waste too many cycles in %sys time and big
> + * enough to reduce (ideally avoid) cache line contention.
> + */
> +#define CPU_MIN_BUSY_CYCLES 1
> +#define CPU_MAX_BUSY_CYCLES 64
> +
> +/*
> * Functions for manipulating the kernel_lock. We put them here
> * so that they show up in profiles.
> */
> @@ -228,9 +243,7 @@ void
> mtx_enter(struct mutex *mtx)
> {
> struct schedstate_percpu *spc = &curcpu()->ci_schedstate;
> -#ifdef MP_LOCKDEBUG
> - int nticks = __mp_lock_spinout;
> -#endif
> + unsigned int i, ncycle = CPU_MIN_BUSY_CYCLES;
>
> WITNESS_CHECKORDER(MUTEX_LOCK_OBJECT(mtx),
> LOP_EXCLUSIVE | LOP_NEWORDER, NULL);
> @@ -238,15 +251,11 @@ mtx_enter(struct mutex *mtx)
> spc->spc_spinning++;
> while (mtx_enter_try(mtx) == 0) {
> do {
> - CPU_BUSY_CYCLE();
> -#ifdef MP_LOCKDEBUG
> - if (--nticks == 0) {
> - db_printf("%s: %p lock spun out\n",
> - __func__, mtx);
> - db_enter();
> - nticks = __mp_lock_spinout;
> - }
> -#endif
> + /* Busy loop with exponential backoff. */
> + for (i = ncycle; i > 0; i--)
> + CPU_BUSY_CYCLE();
> + if (ncycle < CPU_MAX_BUSY_CYCLES)
> + ncycle += ncycle;
> } while (mtx->mtx_owner != NULL);
> }
> spc->spc_spinning--;
> @@ -265,20 +274,27 @@ mtx_enter_try(struct mutex *mtx)
> if (mtx->mtx_wantipl != IPL_NONE)
> s = splraise(mtx->mtx_wantipl);
>
> - owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> + /*
> + * Avoid unconditional atomic operation to prevent cache line
> + * contention.
> + */
> + owner = mtx->mtx_owner;
> #ifdef DIAGNOSTIC
> if (__predict_false(owner == ci))
> panic("mtx %p: locking against myself", mtx);
> #endif
> if (owner == NULL) {
> - membar_enter_after_atomic();
> - if (mtx->mtx_wantipl != IPL_NONE)
> - mtx->mtx_oldipl = s;
> + owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> + if (owner == NULL) {
> + membar_enter_after_atomic();
> + if (mtx->mtx_wantipl != IPL_NONE)
> + mtx->mtx_oldipl = s;
> #ifdef DIAGNOSTIC
> - ci->ci_mutex_level++;
> + ci->ci_mutex_level++;
> #endif
> - WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
> - return (1);
> + WITNESS_LOCK(MUTEX_LOCK_OBJECT(mtx), LOP_EXCLUSIVE);
> + return (1);
> + }
> }
>
> if (mtx->mtx_wantipl != IPL_NONE)
> @@ -353,6 +369,7 @@ void
> db_mtx_enter(struct db_mutex *mtx)
> {
> struct cpu_info *ci = curcpu(), *owner;
> + unsigned int i, ncycle = CPU_MIN_BUSY_CYCLES;
> unsigned long s;
>
> #ifdef DIAGNOSTIC
> @@ -361,12 +378,22 @@ db_mtx_enter(struct db_mutex *mtx)
> #endif
>
> s = intr_disable();
> -
> for (;;) {
> - owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> - if (owner == NULL)
> - break;
> - CPU_BUSY_CYCLE();
> + /*
> + * Avoid unconditional atomic operation to prevent cache
> + * line contention.
> + */
> + owner = mtx->mtx_owner;
> + if (owner == NULL) {
> + owner = atomic_cas_ptr(&mtx->mtx_owner, NULL, ci);
> + if (owner == NULL)
> + break;
> + /* Busy loop with exponential backoff. */
> + for (i = ncycle; i > 0; i--)
> + CPU_BUSY_CYCLE();
> + if (ncycle < CPU_MAX_BUSY_CYCLES)
> + ncycle += ncycle;
> + }
> }
> membar_enter_after_atomic();
>
>
>
>
mtx_enter() fix for huge systems