Index | Thread | Search

From:
Mateusz Guzik <mjguzik@gmail.com>
Subject:
Re: per-CPU page caches for page faults
To:
Martin Pieuchot <mpi@openbsd.org>
Cc:
tech@openbsd.org
Date:
Thu, 21 Mar 2024 11:52:17 +0100

Download raw body.

Thread
On Mon, Mar 18, 2024 at 07:13:43PM +0000, Martin Pieuchot wrote:
> What is the idea behind this diff?  With a consequent number of CPUs (16
> or more) grabbing a global mutex for every page allocation & free creates
> a lot of contention resulting in many CPU cycles wasted in system (kernel)
> time.  The idea of this diff is to add another layer on top of the global
> allocator to allocate and free pages in batch.  Note that, in this diff,
> this cache is only used for page faults.
> 
> +/*
> + * uvm_pcpu_getpage: allocate a page from the current CPU's cache
> + */
> +struct vm_page *
> +uvm_pcpu_getpage(int flags)
> +{
> +	struct uvm_percpu *upc = &curcpu()->ci_uvm;
> +	struct vm_page *pg;
> +
> +	if (upc->upc_count == 0) {
> +		atomic_inc_int(&uvmexp.pcpmiss);
> +		if (uvm_pcpu_fillcache())
> +			return NULL;
> +	} else {
> +		atomic_inc_int(&uvmexp.pcphit);
> +	}
> +
> +	atomic_dec_int(&uvmexp.percpucaches);
> +	upc->upc_count--;
> +	pg = upc->upc_pages[upc->upc_count];
> +
> +	if (flags & UVM_PLA_ZERO)
> +		uvm_pagezero(pg);
> +
> +	return pg;
> +}
> +

First 2 minor remarks:

1. maintaining stats in a global struct avoidably reduces
single-threaded perf due to atomics and scalability due to cacheline
bounces. the hand-rolled mechanism should have its own cpu-local stats.

2. uvm_pagezero on amd64 is implemented with non-temporal stores because
of the dedicated kernel thread for background page zeroing. while I
consider existence of such a thread to be long obsolete, I'm going to
ignore this aspect. key here is that nt stores before direct usage of
the page only results in cache misses later on (even more so if you have
a LIFO allocation policy and the page was mostly in L3). instead
uvm_pagezero_onfault or something could be added to zero "normally".

All that aside the real question is why the hand-rolled mechanism in the
first place? I see your "pool allocator" has a per-cpu caching layer,
and based on that I would expect caching to be implemented on top of it.

If the pool allocator has significant shortcommings they should probably
get addressed instead of rolling with a dedicated mechanism.