From: David Gwynne <david@gwynne.id.au>
Subject: Re: per-CPU page caches for page faults
To: mpi@openbsd.org
Cc: tech@openbsd.org
Date: Tue, 19 Mar 2024 15:06:12 +1000

On Mon, Mar 18, 2024 at 08:13:43PM +0100, Martin Pieuchot wrote:
> Diff below attaches a 16 page array to the "struct cpuinfo" and uses it
> as a cache to reduce contention on the global pmemrange mutex.
> 
> Measured performance improvements are between 7% to 13% with 16 CPUs
> and 19% to 33% with 32 CPUs.  -current OpenBSD doesn't scale above 32
> CPUs so it wouldn't be fair to compare number of jobs spread across
> more CPUs.  However, as you can see below, this limitation is no longer
> true with this diff.
> 
> kernel
> ------
> 16:     1m47.93s real    11m24.18s user    10m55.78s system
> 32:     2m33.30s real    11m46.08s user    32m32.35s system (BC cold)
>         2m02.36s real    11m55.12s user    21m40.66s system
> 64:     2m00.72s real    11m59.59s user    25m47.63s system
> 
> libLLVM
> -------
> 16:     30m45.54s real   363m25.35s user   150m34.05s system
> 32:     24m29.88s real   409m49.80s user   311m02.54s system
> 64:     29m22.63s real   404m16.20s user   771m31.26s system
> 80:     30m12.49s real   398m07.01s user   816m01.71s system
> 
> kernel+percpucaches(16)
> ------
> 16:	1m30.17s real    11m19.29s user     6m42.08s system
> 32:     2m02.28s real    11m42.13s user    23m42.64s system (BC cold)
>         1m22.82s real    11m41.72s user     8m50.12s system
> 64:     1m23.47s real    11m56.99s user     9m42.00s system
> 80:     1m24.63s real    11m44.24s user    10m38.00s system
> 
> libLLVM+percpucaches(16)
> -------
> 16:     28m38.73s real   363m34.69s user    95m45.68s system
> 32:     19m57.71s real   415m17.23s user   174m47.83s system
> 64:     18m59.50s real   450m17.79s user   406m05.42s system
> 80:     19m02.26s real   452m35.11s user   473m09.05s system
> 
> Still the most important impact of this diff is the reduction of %sys
> time.  It drops from ~40% with 16 CPUs and ~55% with 32 CPUs or more.
> 
> What is the idea behind this diff?  With a consequent number of CPUs (16
> or more) grabbing a global mutex for every page allocation & free creates
> a lot of contention resulting in many CPU cycles wasted in system (kernel)
> time.  The idea of this diff is to add another layer on top of the global
> allocator to allocate and free pages in batch.  Note that, in this diff,
> this cache is only used for page faults.
> 
> The number of 16 has been chosen after careful testing on a 80 CPU Ampere
> machine.  It tried to keep it as small as possible while making sure that
> multiple parallel page faults on a large number of CPUs do not result in
> contention.  I'd argue that "stealing" at most 64k per CPU is acceptable
> on any MP system.
> 
> The diff includes 3 new counters visible in "systat uvm" and "vmstat -s".
> 
> When the page daemon kicks in we drain the cache of the current CPU which
> is the best we can do without adding too much complexity.
> 
> I only tested amd64 and arm64, that's why there is such define in
> uvm/uvm_page.c.  I'd be happy to hear from tests on other architectures
> and different topologies.  You'll need to edit $arch/include/cpu.h and
> modify the define.
> 
> This diff is really interesting because it now allows us to clearly see
> which syscall are contenting a lot.  Without surprise it's kbind(2),
> munmap(2) and mprotect(2).  It also shows which workloads are VFS-bound.
> That is what the "Buffer-Cache Cold" (BC Cold) numbers represent above.
> With a small number of CPUs we don't see much difference between the two.
> 
> Comments?

i like the idea, and i like the improvements.

this is basically the same problem that jeff bonwick deals with in
his magazines and vmem paper about the changes he made to the solaris
slab allocator to make it scale on machines with a bunch of cpus.
that's the reference i used when i implemented per cpu caches in
pools, and it's probably worth following here as well. the only
real change i'd want you to make is to introduce the "previously
loaded magazine" to mitigate thrashing as per section 3.1 in the
paper.

pretty exciting though.

> Index: usr.bin/systat/uvm.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/systat/uvm.c,v
> diff -u -p -r1.6 uvm.c
> --- usr.bin/systat/uvm.c	27 Nov 2022 23:18:54 -0000	1.6
> +++ usr.bin/systat/uvm.c	12 Mar 2024 16:53:11 -0000
> @@ -80,11 +80,10 @@ struct uvmline uvmline[] = {
>  	{ &uvmexp.zeropages, &last_uvmexp.zeropages, "zeropages",
>  	  &uvmexp.pageins, &last_uvmexp.pageins, "pageins",
>  	  &uvmexp.fltrelckok, &last_uvmexp.fltrelckok, "fltrelckok" },
> -	{ &uvmexp.reserve_pagedaemon, &last_uvmexp.reserve_pagedaemon,
> -	  "reserve_pagedaemon",
> +	{ &uvmexp.percpucaches, &last_uvmexp.percpucaches, "percpucaches",
>  	  &uvmexp.pgswapin, &last_uvmexp.pgswapin, "pgswapin",
>  	  &uvmexp.fltanget, &last_uvmexp.fltanget, "fltanget" },
> -	{ &uvmexp.reserve_kernel, &last_uvmexp.reserve_kernel, "reserve_kernel",
> +	{ NULL, NULL, NULL,
>  	  &uvmexp.pgswapout, &last_uvmexp.pgswapout, "pgswapout",
>  	  &uvmexp.fltanretry, &last_uvmexp.fltanretry, "fltanretry" },
>  	{ NULL, NULL, NULL,
> @@ -143,13 +142,13 @@ struct uvmline uvmline[] = {
>  	  NULL, NULL, NULL },
>  	{ &uvmexp.pagesize, &last_uvmexp.pagesize, "pagesize",
>  	  &uvmexp.pdpending, &last_uvmexp.pdpending, "pdpending",
> -	  NULL, NULL, NULL },
> +	  NULL, NULL, "Per-CPU Counters" },
>  	{ &uvmexp.pagemask, &last_uvmexp.pagemask, "pagemask",
>  	  &uvmexp.pddeact, &last_uvmexp.pddeact, "pddeact",
> -	  NULL, NULL, NULL },
> +	  &uvmexp.pcphit, &last_uvmexp.pcphit, "pcphit" },
>  	{ &uvmexp.pageshift, &last_uvmexp.pageshift, "pageshift",
>  	  NULL, NULL, NULL,
> -	  NULL, NULL, NULL }
> +	  &uvmexp.pcpmiss, &last_uvmexp.pcpmiss, "pcpmiss" }
>  };
>  
>  field_def fields_uvm[] = {
> Index: usr.bin/vmstat/vmstat.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/vmstat/vmstat.c,v
> diff -u -p -r1.155 vmstat.c
> --- usr.bin/vmstat/vmstat.c	4 Dec 2022 23:50:50 -0000	1.155
> +++ usr.bin/vmstat/vmstat.c	12 Mar 2024 16:49:56 -0000
> @@ -513,7 +513,12 @@ dosum(void)
>  		     uvmexp.reserve_pagedaemon);
>  	(void)printf("%11u pages reserved for kernel\n",
>  		     uvmexp.reserve_kernel);
> +	(void)printf("%11u pages in per-cpu caches\n",
> +		     uvmexp.percpucaches);
>  
> +	/* per-cpu cache */
> +	(void)printf("%11u per-cpu cache hits\n", uvmexp.pcphit);
> +	(void)printf("%11u per-cpu cache misses\n", uvmexp.pcpmiss);
>  	/* swap */
>  	(void)printf("%11u swap pages\n", uvmexp.swpages);
>  	(void)printf("%11u swap pages in use\n", uvmexp.swpginuse);
> Index: sys/uvm/uvm.h
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm.h,v
> diff -u -p -r1.71 uvm.h
> --- sys/uvm/uvm.h	7 Oct 2022 05:01:44 -0000	1.71
> +++ sys/uvm/uvm.h	6 Mar 2024 15:10:33 -0000
> @@ -47,6 +47,7 @@
>   *
>   *  Locks used to protect struct members in this file:
>   *	Q	uvm.pageqlock
> + *	F	uvm.fpageqlock
>   */
>  struct uvm {
>  	/* vm_page related parameters */
> @@ -58,7 +59,7 @@ struct uvm {
>  	struct mutex pageqlock;		/* [] lock for active/inactive page q */
>  	struct mutex fpageqlock;	/* [] lock for free page q  + pdaemon */
>  	boolean_t page_init_done;	/* TRUE if uvm_page_init() finished */
> -	struct uvm_pmr_control pmr_control; /* pmemrange data */
> +	struct uvm_pmr_control pmr_control; /* [F] pmemrange data */
>  
>  		/* page daemon trigger */
>  	int pagedaemon;			/* daemon sleeps on this */
> Index: sys/uvm/uvm_page.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> diff -u -p -r1.174 uvm_page.c
> --- sys/uvm/uvm_page.c	13 Feb 2024 10:16:28 -0000	1.174
> +++ sys/uvm/uvm_page.c	18 Mar 2024 17:37:17 -0000
> @@ -75,6 +75,7 @@
>  #include <sys/smr.h>
>  
>  #include <uvm/uvm.h>
> +#include <uvm/uvm_percpu.h>
>  
>  /*
>   * for object trees
> @@ -119,6 +120,8 @@ static vaddr_t      virtual_space_end;
>  static void uvm_pageinsert(struct vm_page *);
>  static void uvm_pageremove(struct vm_page *);
>  int uvm_page_owner_locked_p(struct vm_page *);
> +struct vm_page *uvm_pcpu_getpage(int);
> +int uvm_pcpu_putpage(struct vm_page *);
>  
>  /*
>   * inline functions
> @@ -869,6 +872,7 @@ uvm_pagerealloc_multi(struct uvm_object 
>  	return r;
>  }
>  
> +
>  /*
>   * uvm_pagealloc: allocate vm_page from a particular free list.
>   *
> @@ -877,13 +881,11 @@ uvm_pagerealloc_multi(struct uvm_object 
>   * => only one of obj or anon can be non-null
>   * => caller must activate/deactivate page if it is not wired.
>   */
> -
>  struct vm_page *
>  uvm_pagealloc(struct uvm_object *obj, voff_t off, struct vm_anon *anon,
>      int flags)
>  {
> -	struct vm_page *pg;
> -	struct pglist pgl;
> +	struct vm_page *pg = NULL;
>  	int pmr_flags;
>  
>  	KASSERT(obj == NULL || anon == NULL);
> @@ -906,12 +908,18 @@ uvm_pagealloc(struct uvm_object *obj, vo
>  
>  	if (flags & UVM_PGA_ZERO)
>  		pmr_flags |= UVM_PLA_ZERO;
> -	TAILQ_INIT(&pgl);
> -	if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, pmr_flags, &pgl) != 0)
> -		goto fail;
>  
> -	pg = TAILQ_FIRST(&pgl);
> -	KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL);
> +	pg = uvm_pcpu_getpage(pmr_flags);
> +	if (pg == NULL) {
> +		struct pglist pgl;
> +
> +		TAILQ_INIT(&pgl);
> +		if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, pmr_flags, &pgl) != 0)
> +			goto fail;
> +
> +		pg = TAILQ_FIRST(&pgl);
> +		KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL);
> +	}
>  
>  	uvm_pagealloc_pg(pg, obj, off, anon);
>  	KASSERT((pg->pg_flags & PG_DEV) == 0);
> @@ -1025,7 +1033,8 @@ void
>  uvm_pagefree(struct vm_page *pg)
>  {
>  	uvm_pageclean(pg);
> -	uvm_pmr_freepages(pg, 1);
> +	if (uvm_pcpu_putpage(pg) == 0)
> +		uvm_pmr_freepages(pg, 1);
>  }
>  
>  /*
> @@ -1398,3 +1407,120 @@ uvm_pagecount(struct uvm_constraint_rang
>  	}
>  	return sz;
>  }
> +
> +#if defined(MULTIPROCESSOR) && (defined(__amd64__) || defined(__arm64__))
> +
> +/*
> + * uvm_pcpu_draincache: populate the current CPU cache
> + */
> +int
> +uvm_pcpu_fillcache(void)
> +{
> +	struct uvm_percpu *upc = &curcpu()->ci_uvm;
> +	struct vm_page *pg;
> +	struct pglist pgl;
> +	int count;
> +
> +	/* Leave one free room to not drain the cache at the next free. */
> +	count = UVM_PCPU_MAXPAGES - upc->upc_count - 1;
> +
> +	TAILQ_INIT(&pgl);
> +	if (uvm_pmr_getpages(count, 0, 0, 1, 0, 1, UVM_PLA_NOWAIT, &pgl))
> +		return -1;
> +
> +	while ((pg = TAILQ_FIRST(&pgl)) != NULL) {
> +		TAILQ_REMOVE(&pgl, pg, pageq);
> +		upc->upc_pages[upc->upc_count] = pg;
> +		upc->upc_count++;
> +	}
> +	atomic_add_int(&uvmexp.percpucaches, count);
> +
> +	return 0;
> +}
> +
> +/*
> + * uvm_pcpu_getpage: allocate a page from the current CPU's cache
> + */
> +struct vm_page *
> +uvm_pcpu_getpage(int flags)
> +{
> +	struct uvm_percpu *upc = &curcpu()->ci_uvm;
> +	struct vm_page *pg;
> +
> +	if (upc->upc_count == 0) {
> +		atomic_inc_int(&uvmexp.pcpmiss);
> +		if (uvm_pcpu_fillcache())
> +			return NULL;
> +	} else {
> +		atomic_inc_int(&uvmexp.pcphit);
> +	}
> +
> +	atomic_dec_int(&uvmexp.percpucaches);
> +	upc->upc_count--;
> +	pg = upc->upc_pages[upc->upc_count];
> +
> +	if (flags & UVM_PLA_ZERO)
> +		uvm_pagezero(pg);
> +
> +	return pg;
> +}
> +
> +/*
> + * uvm_pcpu_draincache: empty the current CPU cache
> + */
> +void
> +uvm_pcpu_draincache(void)
> +{
> +	struct uvm_percpu *upc = &curcpu()->ci_uvm;
> +	struct pglist pgl;
> +	int i;
> +
> +	TAILQ_INIT(&pgl);
> +	for (i = 0; i < upc->upc_count; i++)
> +		TAILQ_INSERT_TAIL(&pgl, upc->upc_pages[i], pageq);
> +
> +	uvm_pmr_freepageq(&pgl);
> +
> +	atomic_sub_int(&uvmexp.percpucaches, upc->upc_count);
> +	upc->upc_count = 0;
> +	memset(upc->upc_pages, 0, sizeof(upc->upc_pages));
> +}
> +
> +/*
> + * uvm_pcpu_putpage: free a page and place it on the current CPU's cache
> + */
> +int
> +uvm_pcpu_putpage(struct vm_page *pg)
> +{
> +	struct uvm_percpu *upc = &curcpu()->ci_uvm;
> +
> +	if (upc->upc_count >= UVM_PCPU_MAXPAGES)
> +		uvm_pcpu_draincache();
> +
> +	upc->upc_pages[upc->upc_count] = pg;
> +	upc->upc_count++;
> +	atomic_inc_int(&uvmexp.percpucaches);
> +
> +	return 1;
> +}
> +
> +#else /* !MULTIPROCESSOR */
> +
> +struct vm_page *
> +uvm_pcpu_getpage(int flags)
> +{
> +	return NULL;
> +}
> +
> +void
> +uvm_pcpu_draincache(void)
> +{
> +}
> +
> +int
> +uvm_pcpu_putpage(struct vm_page *pg)
> +{
> +	return 0;
> +}
> +
> +#endif /* MULTIPROCESSOR */
> Index: sys/uvm/uvm_pmemrange.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_pmemrange.c,v
> diff -u -p -r1.63 uvm_pmemrange.c
> --- sys/uvm/uvm_pmemrange.c	10 Apr 2023 04:21:20 -0000	1.63
> +++ sys/uvm/uvm_pmemrange.c	11 Mar 2024 17:40:27 -0000
> @@ -1352,8 +1352,6 @@ uvm_pmr_freepageq(struct pglist *pgl)
>  	if (uvmexp.zeropages < UVM_PAGEZERO_TARGET)
>  		wakeup(&uvmexp.zeropages);
>  	uvm_unlock_fpageq();
> -
> -	return;
>  }
>  
>  /*
> Index: sys/uvm/uvmexp.h
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvmexp.h,v
> diff -u -p -r1.11 uvmexp.h
> --- sys/uvm/uvmexp.h	27 Oct 2023 19:18:53 -0000	1.11
> +++ sys/uvm/uvmexp.h	18 Mar 2024 17:35:24 -0000
> @@ -64,7 +64,7 @@ struct uvmexp {
>  	int zeropages;		/* [F] number of zero'd pages */
>  	int reserve_pagedaemon; /* [I] # of pages reserved for pagedaemon */
>  	int reserve_kernel;	/* [I] # of pages reserved for kernel */
> -	int unused01;		/* formerly anonpages */
> +	int percpucaches;	/* [a] # of pages in per-CPU caches */
>  	int vnodepages;		/* XXX # of pages used by vnode page cache */
>  	int vtextpages;		/* XXX # of pages used by vtext vnodes */
>  
> @@ -99,8 +99,8 @@ struct uvmexp {
>  	int syscalls;		/* system calls */
>  	int pageins;		/* pagein operation count */
>  				/* pageouts are in pdpageouts below */
> -	int unused07;		/* formerly obsolete_swapins */
> -	int unused08;		/* formerly obsolete_swapouts */
> +	int pcphit;		/* [a] # of pagealloc from per-CPU cache */
> +	int pcpmiss;		/* [a] # of times a per-CPU cache was empty */
>  	int pgswapin;		/* pages swapped in */
>  	int pgswapout;		/* pages swapped out */
>  	int forks;  		/* forks */
> Index: sys/uvm/uvm_pdaemon.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
> diff -u -p -r1.109 uvm_pdaemon.c
> --- sys/uvm/uvm_pdaemon.c	27 Oct 2023 19:18:53 -0000	1.109
> +++ sys/uvm/uvm_pdaemon.c	17 Mar 2024 17:54:02 -0000
> @@ -80,6 +80,7 @@
>  #endif
>  
>  #include <uvm/uvm.h>
> +#include <uvm/uvm_percpu.h>
>  
>  #include "drm.h"
>  
> @@ -276,6 +277,8 @@ uvm_pageout(void *arg)
>  #if NDRM > 0
>  		drmbackoff(size * 2);
>  #endif
> +		uvm_pcpu_draincache();
> +
>  		uvm_lock_pageq();
>  
>  		/*
> Index: sys/uvm/uvm_percpu.h
> ===================================================================
> RCS file: sys/uvm/uvm_percpu.h
> diff -N sys/uvm/uvm_percpu.h
> --- /dev/null	1 Jan 1970 00:00:00 -0000
> +++ sys/uvm/uvm_percpu.h	17 Mar 2024 17:48:49 -0000
> @@ -0,0 +1,43 @@
> +/*	$OpenBSD$	*/
> +
> +/*
> + * Copyright (c) 2024 Martin Pieuchot <mpi@openbsd.org>
> + *
> + * Permission to use, copy, modify, and distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +#ifndef _UVM_UVM_PCPU_H_
> +#define _UVM_UVM_PCPU_H_
> +
> +/*
> + * We want a per-CPU cache size to be as small as possible and at the
> + * same time gets rid of the `uvm_lock_fpageq' contention.
> + */
> +#define UVM_PCPU_MAXPAGES 16
> +
> +struct vm_page;
> +
> +/*
> + * Per-CPU page cache
> + *
> + *  Locks used to protect struct members in this file:
> + *	o	owned (read/modified only) by its CPU
> + */
> +struct uvm_percpu {
> +	struct vm_page *upc_pages[UVM_PCPU_MAXPAGES];	/* [o] */
> +	int		upc_count;		/* [o] # of pages in cache */
> +};
> +
> +void uvm_pcpu_draincache(void);
> +
> +#endif /* _UVM_UVM_PCPU_H_ */
> Index: sys/arch/amd64/include/cpu.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
> diff -u -p -r1.163 cpu.h
> --- sys/arch/amd64/include/cpu.h	25 Feb 2024 19:15:50 -0000	1.163
> +++ sys/arch/amd64/include/cpu.h	10 Mar 2024 17:33:55 -0000
> @@ -53,6 +53,7 @@
>  #include <sys/sched.h>
>  #include <sys/sensors.h>
>  #include <sys/srp.h>
> +#include <uvm/uvm_percpu.h>
>  
>  #ifdef _KERNEL
>  
> @@ -201,6 +202,7 @@ struct cpu_info {
>  
>  #ifdef MULTIPROCESSOR
>  	struct srp_hazard	ci_srp_hazards[SRP_HAZARD_NUM];
> +	struct uvm_percpu	ci_uvm;		/* [o] page cache */
>  #endif
>  
>  	struct ksensordev	ci_sensordev;
> Index: sys/arch/arm64/include/cpu.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/arm64/include/cpu.h,v
> diff -u -p -r1.43 cpu.h
> --- sys/arch/arm64/include/cpu.h	25 Feb 2024 19:15:50 -0000	1.43
> +++ sys/arch/arm64/include/cpu.h	12 Mar 2024 16:23:37 -0000
> @@ -108,6 +108,7 @@ void	arm32_vector_init(vaddr_t, int);
>  #include <sys/device.h>
>  #include <sys/sched.h>
>  #include <sys/srp.h>
> +#include <uvm/uvm_percpu.h>
>  
>  struct cpu_info {
>  	struct device		*ci_dev; /* Device corresponding to this CPU */
> @@ -161,6 +162,7 @@ struct cpu_info {
>  
>  #ifdef MULTIPROCESSOR
>  	struct srp_hazard	ci_srp_hazards[SRP_HAZARD_NUM];
> +	struct uvm_percpu	ci_uvm;
>  	volatile int		ci_flags;
>  
>  	volatile int		ci_ddb_paused;
>