Index | Thread | Search

From:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Subject:
Re: wake up mbuf pools when pages get released
To:
David Gwynne <david@gwynne.id.au>
Cc:
tech@openbsd.org
Date:
Fri, 30 Jan 2026 06:17:22 +0100

Download raw body.

Thread
On Fri, Jan 30, 2026 at 11:40:54AM +1000, David Gwynne wrote:
> On Thu, Jan 29, 2026 at 01:45:06PM +0100, Claudio Jeker wrote:
> > On Thu, Jan 29, 2026 at 09:10:00PM +1000, David Gwynne wrote:
> > > On Thu, Jan 29, 2026 at 10:35:05AM +0100, Claudio Jeker wrote:
> > > > On Thu, Jan 29, 2026 at 10:53:17AM +1000, David Gwynne wrote:
> > > > > mbufs are special for lots of reasons, but one is that the total
> > > > > amount of memory that mbufs can be allocated out of is limited by
> > > > > mbuf_mem_limit. all the mbuf and cluster pools are subject to that
> > > > > limit, which is enforced by having these pools use a custom pool
> > > > > page allocator that checks that limit and accounts for their use
> > > > > of it.
> > > > > 
> > > > > the problem is the pools don't coordinate with each other. when
> > > > > mbuf_mem_limit is hit, it's possible for a sleeping allocation to
> > > > > wait on memory in one pool, but when memory is released by another
> > > > > pool that first one doesn't know about it, and doesn't get woken
> > > > > up to try and allocate pages that are now free in the backend page
> > > > > allocator.
> > > > > 
> > > > > the simple fix for this is to wakeup the mbuf pools when pages are
> > > > > returned to the backend mbuf page allocator. if any of the pools have
> > > > > pending allocation requests, they are moved forward by the wakeup.
> > > > > 
> > > > > this means if a system does hit the mbuf mem limit and a lot of
> > > > > procs/threads get stuck sleeping on mbuf allocations, there's a
> > > > > better chance they can now be pushed forward if another mbuf pool
> > > > > backs off and gives memory back to the system.
> > > > > 
> > > > > the wakeups are deferred to a task running in the systqmp taskq.
> > > > > this is the same taskq that the pool gc ops run in. if multiple
> > > > > mbuf pools have gced pages released, this debounces the wakeup calls
> > > > > so they only happen once per pool gc run.
> > > > > 
> > > > > i could avoid the wakeup calls by only scheduling the task when the
> > > > > current mbuf_mem_alloc value is close to mbuf_mem_limit, but the pool gc
> > > > > process is the extremely slow path anyway. the ratio of pool_put
> > > > > operations to m_pool_free ops is many millions to one.
> > > > > 
> > > > > im going to commit this in the next day or two unless there are
> > > > > objections. oks are welcome too.
> > > > 
> > > > OK claudio@
> > > > 
> > > > Is the delay introduced deboucing via the task a problem? Once the
> > > > memory is freed some other pool is able to grab it before the wakeup makes
> > > > it through.
> > > 
> > > at the moment the only thing that can take the just released memory is
> > > some other pool, so we're not changing that aspect. however, we're
> > > adding the possibility that another cluster pool that's waiting for the
> > > memory gets a change to take it too.
> > 
> > I'm currently just worried about these wakeup games because of the
> > pagedaemon and how all the sleep/wakeup work (or don't work) there.
> > Delaying wakeups often causes work backlogs to build up which in the worst
> > case take the system down in a doom spiral.
> 
> i understand, but again, you should focus on the fact that this adds
> wakeups that just don't exist at the moment.

I know, I just try to think ahead on what the next tripping point will be.
This is why I brought it up. I agree with all you said.
 
> > > > Also isn't m_pool_free() already running on the pool gc (and so systqmp)
> > > > taskq?
> > > 
> > > yes, that's where the debouncing comes from. it is possible that
> > > multiple cluster pools release memory in the same gc run. because the
> > > wakeups are run in the same thread and serialised by it, there's only
> > > one wakeup for all those frees.
> > 
> > But you could just call wakeup knowing that only the first one will
> > actually wakeup a sleeping pool. Calling wakeup on a ident that is not
> > asleep should be super cheap (not sure if it is right now).
> 
> two things. 
> 
> firstly, a single pool_gc run can return multiple pages. it doesnt
> make sense to me to wake everything up after every free when we
> have an opportunity to free a bunch of stuff and then do the wake
> up once.

Yes, don't do wakeups per page that would be silly.
 
> secondly, pool_wakeup() doesnt blindly call wakeup(). pools keep a list
> of pending allocation requests. if that list is empty then it returns
> immediately, which is cheap.
> 
> a real wakeup() call has to iterate through a hash bucket while holding
> sched lock to see if it has anything to do.

Indeed, this is why I said "should be super cheap". Right now I think it
is not as cheap as it should.
 
> >  
> > > > 
> > > > > Index: uipc_mbuf.c
> > > > > ===================================================================
> > > > > RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v
> > > > > diff -u -p -r1.302 uipc_mbuf.c
> > > > > --- uipc_mbuf.c	6 Aug 2025 14:00:33 -0000	1.302
> > > > > +++ uipc_mbuf.c	29 Jan 2026 00:32:28 -0000
> > > > > @@ -81,6 +81,7 @@
> > > > >  #include <sys/pool.h>
> > > > >  #include <sys/percpu.h>
> > > > >  #include <sys/sysctl.h>
> > > > > +#include <sys/task.h>
> > > > >  
> > > > >  #include <sys/socket.h>
> > > > >  #include <net/if.h>
> > > > > @@ -131,6 +132,9 @@ void	m_zero(struct mbuf *);
> > > > >  unsigned long mbuf_mem_limit;	/* [a] how much memory can be allocated */
> > > > >  unsigned long mbuf_mem_alloc;	/* [a] how much memory has been allocated */
> > > > >  
> > > > > +void	m_pool_wakeup(void *);
> > > > > +struct task mbuf_mem_wakeup = TASK_INITIALIZER(m_pool_wakeup, NULL);
> > > > > +
> > > > >  void	*m_pool_alloc(struct pool *, int, int *);
> > > > >  void	m_pool_free(struct pool *, void *);
> > > > >  
> > > > > @@ -212,17 +216,13 @@ mbcpuinit(void)
> > > > >  int
> > > > >  nmbclust_update(long newval)
> > > > >  {
> > > > > -	int i;
> > > > > -
> > > > >  	if (newval <= 0 || newval > LONG_MAX / MCLBYTES)
> > > > >  		return ERANGE;
> > > > >  	/* update the global mbuf memory limit */
> > > > >  	atomic_store_long(&nmbclust, newval);
> > > > >  	atomic_store_long(&mbuf_mem_limit, newval * MCLBYTES);
> > > > >  
> > > > > -	pool_wakeup(&mbpool);
> > > > > -	for (i = 0; i < nitems(mclsizes); i++)
> > > > > -		pool_wakeup(&mclpools[i]);
> > > > > +	task_add(systqmp, &mbuf_mem_wakeup);
> > > > >  
> > > > >  	return 0;
> > > > >  }
> > > > > @@ -1471,6 +1471,18 @@ m_pool_free(struct pool *pp, void *v)
> > > > >  	(*pool_allocator_multi.pa_free)(pp, v);
> > > > >  
> > > > >  	atomic_sub_long(&mbuf_mem_alloc, pp->pr_pgsize);
> > > > > +
> > > > > +	task_add(systqmp, &mbuf_mem_wakeup);
> > > > > +}
> > > > > +
> > > > > +void
> > > > > +m_pool_wakeup(void *null)
> > > > > +{
> > > > > +	int i;
> > > > > +
> > > > > +	pool_wakeup(&mbpool);
> > > > > +	for (i = 0; i < nitems(mclsizes); i++)
> > > > > +		pool_wakeup(&mclpools[i]);
> > > > >  }
> > > > >  
> > > > >  void
> > > > > 
> > > > 
> > > > -- 
> > > > :wq Claudio
> > > > 
> > > 
> > 
> > -- 
> > :wq Claudio
> 

-- 
:wq Claudio