From: David Gwynne Subject: Re: wake up mbuf pools when pages get released To: tech@openbsd.org Date: Fri, 30 Jan 2026 11:40:54 +1000 On Thu, Jan 29, 2026 at 01:45:06PM +0100, Claudio Jeker wrote: > On Thu, Jan 29, 2026 at 09:10:00PM +1000, David Gwynne wrote: > > On Thu, Jan 29, 2026 at 10:35:05AM +0100, Claudio Jeker wrote: > > > On Thu, Jan 29, 2026 at 10:53:17AM +1000, David Gwynne wrote: > > > > mbufs are special for lots of reasons, but one is that the total > > > > amount of memory that mbufs can be allocated out of is limited by > > > > mbuf_mem_limit. all the mbuf and cluster pools are subject to that > > > > limit, which is enforced by having these pools use a custom pool > > > > page allocator that checks that limit and accounts for their use > > > > of it. > > > > > > > > the problem is the pools don't coordinate with each other. when > > > > mbuf_mem_limit is hit, it's possible for a sleeping allocation to > > > > wait on memory in one pool, but when memory is released by another > > > > pool that first one doesn't know about it, and doesn't get woken > > > > up to try and allocate pages that are now free in the backend page > > > > allocator. > > > > > > > > the simple fix for this is to wakeup the mbuf pools when pages are > > > > returned to the backend mbuf page allocator. if any of the pools have > > > > pending allocation requests, they are moved forward by the wakeup. > > > > > > > > this means if a system does hit the mbuf mem limit and a lot of > > > > procs/threads get stuck sleeping on mbuf allocations, there's a > > > > better chance they can now be pushed forward if another mbuf pool > > > > backs off and gives memory back to the system. > > > > > > > > the wakeups are deferred to a task running in the systqmp taskq. > > > > this is the same taskq that the pool gc ops run in. if multiple > > > > mbuf pools have gced pages released, this debounces the wakeup calls > > > > so they only happen once per pool gc run. > > > > > > > > i could avoid the wakeup calls by only scheduling the task when the > > > > current mbuf_mem_alloc value is close to mbuf_mem_limit, but the pool gc > > > > process is the extremely slow path anyway. the ratio of pool_put > > > > operations to m_pool_free ops is many millions to one. > > > > > > > > im going to commit this in the next day or two unless there are > > > > objections. oks are welcome too. > > > > > > OK claudio@ > > > > > > Is the delay introduced deboucing via the task a problem? Once the > > > memory is freed some other pool is able to grab it before the wakeup makes > > > it through. > > > > at the moment the only thing that can take the just released memory is > > some other pool, so we're not changing that aspect. however, we're > > adding the possibility that another cluster pool that's waiting for the > > memory gets a change to take it too. > > I'm currently just worried about these wakeup games because of the > pagedaemon and how all the sleep/wakeup work (or don't work) there. > Delaying wakeups often causes work backlogs to build up which in the worst > case take the system down in a doom spiral. i understand, but again, you should focus on the fact that this adds wakeups that just don't exist at the moment. > > > Also isn't m_pool_free() already running on the pool gc (and so systqmp) > > > taskq? > > > > yes, that's where the debouncing comes from. it is possible that > > multiple cluster pools release memory in the same gc run. because the > > wakeups are run in the same thread and serialised by it, there's only > > one wakeup for all those frees. > > But you could just call wakeup knowing that only the first one will > actually wakeup a sleeping pool. Calling wakeup on a ident that is not > asleep should be super cheap (not sure if it is right now). two things. firstly, a single pool_gc run can return multiple pages. it doesnt make sense to me to wake everything up after every free when we have an opportunity to free a bunch of stuff and then do the wake up once. secondly, pool_wakeup() doesnt blindly call wakeup(). pools keep a list of pending allocation requests. if that list is empty then it returns immediately, which is cheap. a real wakeup() call has to iterate through a hash bucket while holding sched lock to see if it has anything to do. > > > > > > > > Index: uipc_mbuf.c > > > > =================================================================== > > > > RCS file: /cvs/src/sys/kern/uipc_mbuf.c,v > > > > diff -u -p -r1.302 uipc_mbuf.c > > > > --- uipc_mbuf.c 6 Aug 2025 14:00:33 -0000 1.302 > > > > +++ uipc_mbuf.c 29 Jan 2026 00:32:28 -0000 > > > > @@ -81,6 +81,7 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > > > > > #include > > > > #include > > > > @@ -131,6 +132,9 @@ void m_zero(struct mbuf *); > > > > unsigned long mbuf_mem_limit; /* [a] how much memory can be allocated */ > > > > unsigned long mbuf_mem_alloc; /* [a] how much memory has been allocated */ > > > > > > > > +void m_pool_wakeup(void *); > > > > +struct task mbuf_mem_wakeup = TASK_INITIALIZER(m_pool_wakeup, NULL); > > > > + > > > > void *m_pool_alloc(struct pool *, int, int *); > > > > void m_pool_free(struct pool *, void *); > > > > > > > > @@ -212,17 +216,13 @@ mbcpuinit(void) > > > > int > > > > nmbclust_update(long newval) > > > > { > > > > - int i; > > > > - > > > > if (newval <= 0 || newval > LONG_MAX / MCLBYTES) > > > > return ERANGE; > > > > /* update the global mbuf memory limit */ > > > > atomic_store_long(&nmbclust, newval); > > > > atomic_store_long(&mbuf_mem_limit, newval * MCLBYTES); > > > > > > > > - pool_wakeup(&mbpool); > > > > - for (i = 0; i < nitems(mclsizes); i++) > > > > - pool_wakeup(&mclpools[i]); > > > > + task_add(systqmp, &mbuf_mem_wakeup); > > > > > > > > return 0; > > > > } > > > > @@ -1471,6 +1471,18 @@ m_pool_free(struct pool *pp, void *v) > > > > (*pool_allocator_multi.pa_free)(pp, v); > > > > > > > > atomic_sub_long(&mbuf_mem_alloc, pp->pr_pgsize); > > > > + > > > > + task_add(systqmp, &mbuf_mem_wakeup); > > > > +} > > > > + > > > > +void > > > > +m_pool_wakeup(void *null) > > > > +{ > > > > + int i; > > > > + > > > > + pool_wakeup(&mbpool); > > > > + for (i = 0; i < nitems(mclsizes); i++) > > > > + pool_wakeup(&mclpools[i]); > > > > } > > > > > > > > void > > > > > > > > > > -- > > > :wq Claudio > > > > > > > -- > :wq Claudio