Download raw body.
pools: limit the number of items that can be kept in per cpu cache lists
pools: limit the number of items that can be kept in per cpu cache lists
pools: limit the number of items that can be kept in per cpu cache lists
On Fri, Jan 30, 2026 at 11:52:45AM +1000, David Gwynne wrote:
> On Thu, Jan 29, 2026 at 01:39:26PM +0100, Claudio Jeker wrote:
> > On Thu, Jan 29, 2026 at 09:40:08PM +1000, David Gwynne wrote:
> > > On Thu, Jan 29, 2026 at 10:58:16AM +0100, Claudio Jeker wrote:
> > > > On Thu, Jan 29, 2026 at 11:42:06AM +1000, David Gwynne wrote:
> > > > > recent events at work have made me conclude that pools are too greedy,
> > > > > and they end up holding onto free items for a lot longer than they
> > > > > should.
> > > > >
> > > > > pools are deliberately conservative about returning memory to the
> > > > > backend page allocators because it's assumed that if the system just
> > > > > used a certain number of these items, it's likely to do the same thing
> > > > > again in the future. so if you allocate a ton of memory out of a pool,
> > > > > the pool will end up holding onto that memory for a while rather than
> > > > > give it straight back to the backend page allocator.
> > > > >
> > > > > this is made worse when you enable per cpu caches on a pool. this
> > > > > effectively adds another 2 layers of free item caching, one of which has
> > > > > a feature that allows for unlimited growth of the free list size.
> > > > >
> > > > > this is the first of a bunch of little steps to try and mitigate these
> > > > > problems.
> > > > >
> > > > > the per cpu caches in pools have a feature where if they detect
> > > > > contention on the global pool, they'll grow the number of items they'll
> > > > > keep in the per cpu cache to mitigate against that contention in the
> > > > > future. the default and minimum list length is 8 items, but there's
> > > > > currently 's no limit to how long those lists can end up as.
> > > > >
> > > > > this diff limits the growth of these lists to roughly 64 (71) items. the
> > > > > important part of the change is adding the machinery to enforce the
> > > > > limit, i'm happy to fiddle with 64 in the future. i have a bunch of
> > > > > other changes in this space i want to get in before we do that tuning
> > > > > though.
> > > > >
> > > > > ok?
> > > >
> > > > OK claudio@.
> > > > One comment below (which is unrelated to the diff).
> > > >
> > > > > Index: subr_pool.c
> > > > > ===================================================================
> > > > > RCS file: /cvs/src/sys/kern/subr_pool.c,v
> > > > > diff -u -p -r1.243 subr_pool.c
> > > > > --- subr_pool.c 29 Jan 2026 01:04:35 -0000 1.243
> > > > > +++ subr_pool.c 29 Jan 2026 01:14:17 -0000
> > > > > @@ -155,6 +155,7 @@ struct pool_page_header {
> > > > >
> > > > > #ifdef MULTIPROCESSOR
> > > > > #define POOL_CACHE_LIST_MIN 8 /* minimum list length */
> > > > > +#define POOL_CACHE_LIST_MAX 64
> > > > > #define POOL_CACHE_LIST_INC 8
> > > > > #define POOL_CACHE_LIST_DEC 1
> > > > >
> > > > > @@ -2046,13 +2047,13 @@ pool_cache_gc(struct pool *pp)
> > > > >
> > > > > contention = pp->pr_cache_contention;
> > > > > delta = contention - pp->pr_cache_contention_prev;
> > > > > - if (delta > 8 /* magic */) {
> > > > > - if ((ncpusfound * POOL_CACHE_LIST_MIN * 2) <=
> > > > > - pp->pr_cache_nitems)
> > > > > - pp->pr_cache_items += POOL_CACHE_LIST_INC;
> > > > > - } else if (delta == 0) {
> > > > > - if (pp->pr_cache_items > POOL_CACHE_LIST_MIN)
> > > > > - pp->pr_cache_items -= POOL_CACHE_LIST_DEC;
> > > > > + if (delta > 8 /* magic */ &&
> > > > > + pp->pr_cache_items < POOL_CACHE_LIST_MAX &&
> > > > > + (ncpusfound * POOL_CACHE_LIST_MIN * 2) <= pp->pr_cache_nitems) {
> > > >
> > > > I'm worried about
> > > > (ncpusfound * POOL_CACHE_LIST_MIN * 2) <= pp->pr_cache_nitems
> > > >
> > > > I understand that you don't want to increase the pressure if the cache has
> > > > little churn. My worry is that systems with many cpus are more prone to
> > > > have mtx contention but this check makes it harder for such systems to
> > > > scale up. At least this is how I see the interaction here.
> > > > Maybe you can shed some light on this based on your experience.
> > >
> > > so that check only allows the lists to grow if the pool is already
> > > holding on to enough free items to fill the bigger cpu caches with.
> > > it means it can reduce the cpu/lock contention without taking more
> > > memory from the system than it already has.
> > >
> > > pools dont seem to have a problem accumulating free items. just look at
> > > the idle column in systat pc.
> >
> > true, busy pools also do a lot of puts and those end up in the magazines
> > and depots. At some point the depots get too big for their own good.
>
> i keep having to remind myself that if the idle item count is high,
> it's because the system was actually using all those items at some
> point anyway before it could pool_put them all again.
>
> keeping the idle items around is also fine, and actually a great thing,
> if they're going to be used again in the future.
The problem is when you have a surge event that needs lots of resources
occasionally then the depots get very big even though the steady state
requires much less.
I think this is the first step to solve this issue. After that you can
garbage collect in the depos if pages are sitting there for too long.
This way the amount of items should be closer to the steady state need.
> > > you are right that more cpus means more contention though. i am
> > > running this code on my firewall at the moment to try compensate
> > > for that effect:
> > >
> > > if (delta > ncpus * 8 /* magic */ &&
> > >
> > > i wish id made the 8 tunable at runtime, or set it to 4, but i
> > > needed to get the box up and running again as quickly as possible.
> > > regardless, the pools doing the most work on my system still grow,
> > > just nowhere near as much as they used to.
> >
> > So you don't really want to reduce the contention by adding more cache on
> > many cpu machines. Instead the more cpus you have the more you accept some
> > contention. It feels a bit counter intuitive but I see your point that
> > adding more and more cache comes at a high price for the rest of the
> > system.
>
> the problem with growing the per cpu caches (magazines) too large
> is that one cpu can't (by design) take memory from the magazine on
> another cpu when there's a memory shortage. it's better to have the
> idle items in the depot so they're accessible to any cpu that needs
> them.
>
> the more cpus you have, the more memory that ends up in caches on other
> cpus. there's this tension between avoiding contention and losing access
> to idle items.
Yes, magazines should be resonably small and easy to swap. The goal is to
reduce lock contention so even a small amount of rounds in the magazines
will improve your allocation performance. Finding the right size is the
tricky bit and I guess in this case less is probably more :)
> >
> > All makes sense and my OK still holds (just to be clear).
> >
> > > >
> > > >
> > > > > + pp->pr_cache_items += POOL_CACHE_LIST_INC;
> > > > > + } else if (delta == 0 &&
> > > > > + pp->pr_cache_items > POOL_CACHE_LIST_MIN) {
> > > > > + pp->pr_cache_items -= POOL_CACHE_LIST_DEC;
> > > > > }
> > > > > pp->pr_cache_contention_prev = contention;
> > > > > }
> > > > >
> > > > >
> > > >
> > > > --
> > > > :wq Claudio
> > >
> >
> > --
> > :wq Claudio
>
--
:wq Claudio
pools: limit the number of items that can be kept in per cpu cache lists
pools: limit the number of items that can be kept in per cpu cache lists
pools: limit the number of items that can be kept in per cpu cache lists