From: Martin Pieuchot <mpi@openbsd.org>
Subject: Re: per-CPU page caches for page faults
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: tech@openbsd.org
Date: Sun, 24 Mar 2024 11:52:30 +0100

On 21/03/24(Thu) 11:52, Mateusz Guzik wrote:
> On Mon, Mar 18, 2024 at 07:13:43PM +0000, Martin Pieuchot wrote:
> > What is the idea behind this diff?  With a consequent number of CPUs (16
> > or more) grabbing a global mutex for every page allocation & free creates
> > a lot of contention resulting in many CPU cycles wasted in system (kernel)
> > time.  The idea of this diff is to add another layer on top of the global
> > allocator to allocate and free pages in batch.  Note that, in this diff,
> > this cache is only used for page faults.
> > 
> > +/*
> > + * uvm_pcpu_getpage: allocate a page from the current CPU's cache
> > + */
> > +struct vm_page *
> > +uvm_pcpu_getpage(int flags)
> > +{
> > +	struct uvm_percpu *upc = &curcpu()->ci_uvm;
> > +	struct vm_page *pg;
> > +
> > +	if (upc->upc_count == 0) {
> > +		atomic_inc_int(&uvmexp.pcpmiss);
> > +		if (uvm_pcpu_fillcache())
> > +			return NULL;
> > +	} else {
> > +		atomic_inc_int(&uvmexp.pcphit);
> > +	}
> > +
> > +	atomic_dec_int(&uvmexp.percpucaches);
> > +	upc->upc_count--;
> > +	pg = upc->upc_pages[upc->upc_count];
> > +
> > +	if (flags & UVM_PLA_ZERO)
> > +		uvm_pagezero(pg);
> > +
> > +	return pg;
> > +}
> > +
> 
> First 2 minor remarks:
> 
> 1. maintaining stats in a global struct avoidably reduces
> single-threaded perf due to atomics and scalability due to cacheline
> bounces. the hand-rolled mechanism should have its own cpu-local stats.

I measured no difference with using per-CPU counters and since the current
API is incomplete I opted for the simpler version.  So I'd be very happy
seeing a diff from you or anybody else which takes care of all the (evil)
details or using per-cpu counters once this is in.

> 2. uvm_pagezero on amd64 is implemented with non-temporal stores because
> of the dedicated kernel thread for background page zeroing. while I
> consider existence of such a thread to be long obsolete, I'm going to
> ignore this aspect. key here is that nt stores before direct usage of
> the page only results in cache misses later on (even more so if you have
> a LIFO allocation policy and the page was mostly in L3). instead
> uvm_pagezero_onfault or something could be added to zero "normally".

I like a lot your idea however that's completely unrelated to this thread.
I find it difficult however to keep the focus of the proposed changes if
we discuss different topics.  I'd be happy to review such change.  If we
can channel our energies with efficiency it'll be easy to work together.
Please send me and/or to tech@ a diff, even if it is a draft, or your
suggestion and we can start from there.

> All that aside the real question is why the hand-rolled mechanism in the
> first place? I see your "pool allocator" has a per-cpu caching layer,
> and based on that I would expect caching to be implemented on top of it.
>
> If the pool allocator has significant shortcommings they should probably
> get addressed instead of rolling with a dedicated mechanism.

Because they are different allocators for different purposes.  Here we
are dealing with getting physical pages, the pool allocator is dealing
with chunks of pages (items) of a fixed size.