From: Alexander Bluhm <bluhm@openbsd.org>
Subject: Re: mbuf cluster m_extref_mtx contention
To: David Gwynne <david@gwynne.id.au>
Cc: tech@openbsd.org, Claudio Jeker <claudio@openbsd.org>
Date: Mon, 16 Mar 2026 16:55:49 +0100

On Sat, Mar 14, 2026 at 10:25:33AM +1000, David Gwynne wrote:
> On Wed, Feb 25, 2026 at 11:02:59PM +0100, Alexander Bluhm wrote:
> > Hi David,
> > 
> > I habe tested your diff again and throughput in my splicing test
> > goes from 35 GBit/sec to 65 GBit/sec.  Could you please commit it,
> > so I can work on the next bottleneck.
> > 
> > OK bluhm@
> 
> im much more comfortable with the proxy version instead of the
> refcount on a leader version.
> 
> the leader version adds a constraint where the leader can't detach
> a cluster until all references are dropped. the consequence of this
> is immediately demonstrated in m_defrag, and it required auditing
> the rest of the kernel to make sure that nowhere else removed or
> replaced a cluster on an mbuf. it feels like a big footgun that's
> going to hurt someone in the future.
> 
> delaying the actual free of the leader mbuf is ugly too.
> 
> i'd prefer to give up a little bit of performance in exchange for
> less dangerous and ugly code.

I see it differently.  Additional memory allocations make it harder
to understand what is going on in the whole system.  And they may
be slower.

The special hacks in m_defrag() that swap the cluster at an existig
mbuf look ugly to me.  I was happy when I saw your diff that just
chained the new mbuf.

Part of the mbuf design is to use the struct mbuf_ext for cluster
management.  Doing the refcounting there would be the natural thing
in the mbuf world.  And only free of the small mbuf, not the cluster
is delayed.

We need a third oppinion.  Claudio, what do you think?

bluhm