From: David Gwynne <david@gwynne.id.au>
Subject: Re: wake up mbuf pools when pages get released
To: tech@openbsd.org
Date: Thu, 29 Jan 2026 21:10:00 +1000

On Thu, Jan 29, 2026 at 10:35:05AM +0100, Claudio Jeker wrote:
> On Thu, Jan 29, 2026 at 10:53:17AM +1000, David Gwynne wrote:
> > mbufs are special for lots of reasons, but one is that the total
> > amount of memory that mbufs can be allocated out of is limited by
> > mbuf_mem_limit. all the mbuf and cluster pools are subject to that
> > limit, which is enforced by having these pools use a custom pool
> > page allocator that checks that limit and accounts for their use
> > of it.
> > 
> > the problem is the pools don't coordinate with each other. when
> > mbuf_mem_limit is hit, it's possible for a sleeping allocation to
> > wait on memory in one pool, but when memory is released by another
> > pool that first one doesn't know about it, and doesn't get woken
> > up to try and allocate pages that are now free in the backend page
> > allocator.
> > 
> > the simple fix for this is to wakeup the mbuf pools when pages are
> > returned to the backend mbuf page allocator. if any of the pools have
> > pending allocation requests, they are moved forward by the wakeup.
> > 
> > this means if a system does hit the mbuf mem limit and a lot of
> > procs/threads get stuck sleeping on mbuf allocations, there's a
> > better chance they can now be pushed forward if another mbuf pool
> > backs off and gives memory back to the system.
> > 
> > the wakeups are deferred to a task running in the systqmp taskq.
> > this is the same taskq that the pool gc ops run in. if multiple
> > mbuf pools have gced pages released, this debounces the wakeup calls
> > so they only happen once per pool gc run.
> > 
> > i could avoid the wakeup calls by only scheduling the task when the
> > current mbuf_mem_alloc value is close to mbuf_mem_limit, but the pool gc
> > process is the extremely slow path anyway. the ratio of pool_put
> > operations to m_pool_free ops is many millions to one.
> > 
> > im going to commit this in the next day or two unless there are
> > objections. oks are welcome too.
> 
> OK claudio@
> 
> Is the delay introduced deboucing via the task a problem? Once the
> memory is freed some other pool is able to grab it before the wakeup makes
> it through.

at the moment the only thing that can take the just released memory is
some other pool, so we're not changing that aspect. however, we're
adding the possibility that another cluster pool that's waiting for the
memory gets a change to take it too.

> Also isn't m_pool_free() already running on the pool gc (and so systqmp)
> taskq?

yes, that's where the debouncing comes from. it is possible that
multiple cluster pools release memory in the same gc run. because the
wakeups are run in the same thread and serialised by it, there's only
one wakeup for all those frees.

> 
> > Index: uipc_mbuf.c
> > ===================================================================
> > RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
> > diff -u -p -r1.302 uipc_mbuf.c
> > --- uipc_mbuf.c	6 Aug 2025 14:00:33 -0000	1.302
> > +++ uipc_mbuf.c	29 Jan 2026 00:32:28 -0000
> > @@ -81,6 +81,7 @@
> >  #include <sys/pool.h>
> >  #include <sys/percpu.h>
> >  #include <sys/sysctl.h>
> > +#include <sys/task.h>
> >  
> >  #include <sys/socket.h>
> >  #include <net/if.h>
> > @@ -131,6 +132,9 @@ void	m_zero(struct mbuf *);
> >  unsigned long mbuf_mem_limit;	/* [a] how much memory can be allocated */
> >  unsigned long mbuf_mem_alloc;	/* [a] how much memory has been allocated */
> >  
> > +void	m_pool_wakeup(void *);
> > +struct task mbuf_mem_wakeup = TASK_INITIALIZER(m_pool_wakeup, NULL);
> > +
> >  void	*m_pool_alloc(struct pool *, int, int *);
> >  void	m_pool_free(struct pool *, void *);
> >  
> > @@ -212,17 +216,13 @@ mbcpuinit(void)
> >  int
> >  nmbclust_update(long newval)
> >  {
> > -	int i;
> > -
> >  	if (newval <= 0 || newval > LONG_MAX / MCLBYTES)
> >  		return ERANGE;
> >  	/* update the global mbuf memory limit */
> >  	atomic_store_long(&nmbclust, newval);
> >  	atomic_store_long(&mbuf_mem_limit, newval * MCLBYTES);
> >  
> > -	pool_wakeup(&mbpool);
> > -	for (i = 0; i < nitems(mclsizes); i++)
> > -		pool_wakeup(&mclpools[i]);
> > +	task_add(systqmp, &mbuf_mem_wakeup);
> >  
> >  	return 0;
> >  }
> > @@ -1471,6 +1471,18 @@ m_pool_free(struct pool *pp, void *v)
> >  	(*pool_allocator_multi.pa_free)(pp, v);
> >  
> >  	atomic_sub_long(&mbuf_mem_alloc, pp->pr_pgsize);
> > +
> > +	task_add(systqmp, &mbuf_mem_wakeup);
> > +}
> > +
> > +void
> > +m_pool_wakeup(void *null)
> > +{
> > +	int i;
> > +
> > +	pool_wakeup(&mbpool);
> > +	for (i = 0; i < nitems(mclsizes); i++)
> > +		pool_wakeup(&mclpools[i]);
> >  }
> >  
> >  void
> > 
> 
> -- 
> :wq Claudio
>