Download raw body.
per-CPU page caches for page faults
> Date: Mon, 18 Mar 2024 20:13:43 +0100
> From: Martin Pieuchot <mpi@openbsd.org>
> Content-Type: text/plain; charset=utf-8
> Content-Disposition: inline
>
> Diff below attaches a 16 page array to the "struct cpuinfo" and uses it
> as a cache to reduce contention on the global pmemrange mutex.
>
> Measured performance improvements are between 7% to 13% with 16 CPUs
> and 19% to 33% with 32 CPUs. -current OpenBSD doesn't scale above 32
> CPUs so it wouldn't be fair to compare number of jobs spread across
> more CPUs. However, as you can see below, this limitation is no longer
> true with this diff.
>
> kernel
> ------
> 16: 1m47.93s real 11m24.18s user 10m55.78s system
> 32: 2m33.30s real 11m46.08s user 32m32.35s system (BC cold)
> 2m02.36s real 11m55.12s user 21m40.66s system
> 64: 2m00.72s real 11m59.59s user 25m47.63s system
>
> libLLVM
> -------
> 16: 30m45.54s real 363m25.35s user 150m34.05s system
> 32: 24m29.88s real 409m49.80s user 311m02.54s system
> 64: 29m22.63s real 404m16.20s user 771m31.26s system
> 80: 30m12.49s real 398m07.01s user 816m01.71s system
>
> kernel+percpucaches(16)
> ------
> 16: 1m30.17s real 11m19.29s user 6m42.08s system
> 32: 2m02.28s real 11m42.13s user 23m42.64s system (BC cold)
> 1m22.82s real 11m41.72s user 8m50.12s system
> 64: 1m23.47s real 11m56.99s user 9m42.00s system
> 80: 1m24.63s real 11m44.24s user 10m38.00s system
>
> libLLVM+percpucaches(16)
> -------
> 16: 28m38.73s real 363m34.69s user 95m45.68s system
> 32: 19m57.71s real 415m17.23s user 174m47.83s system
> 64: 18m59.50s real 450m17.79s user 406m05.42s system
> 80: 19m02.26s real 452m35.11s user 473m09.05s system
>
> Still the most important impact of this diff is the reduction of %sys
> time. It drops from ~40% with 16 CPUs and ~55% with 32 CPUs or more.
Not much improvement on my 10-core M2 Pro system:
kernel
------
1m05.10s real 5m14.56s user 2m53.34s system
kernel+percpucaches(16)
------
1m03.92s real 5m06.53s user 2m42.14s system
but still an improvement. Guess I should get more cores next time I
upgrade ;).
> What is the idea behind this diff? With a consequent number of CPUs (16
> or more) grabbing a global mutex for every page allocation & free creates
> a lot of contention resulting in many CPU cycles wasted in system (kernel)
> time. The idea of this diff is to add another layer on top of the global
> allocator to allocate and free pages in batch. Note that, in this diff,
> this cache is only used for page faults.
>
> The number of 16 has been chosen after careful testing on a 80 CPU Ampere
> machine. It tried to keep it as small as possible while making sure that
> multiple parallel page faults on a large number of CPUs do not result in
> contention. I'd argue that "stealing" at most 64k per CPU is acceptable
> on any MP system.
Yes, and you do drain the cache, so if we really need that memory,
most it will be available.
> The diff includes 3 new counters visible in "systat uvm" and "vmstat -s".
>
> When the page daemon kicks in we drain the cache of the current CPU which
> is the best we can do without adding too much complexity.
There is an interesting trade-off to be made between bunching allocs
and frees and having free pages in the cache. You opted to use 15 of
the 16 slots for bunching allocs/frees. Did you explore a more even
split like 8/8? Mostly just curious.
> I only tested amd64 and arm64, that's why there is such define in
> uvm/uvm_page.c. I'd be happy to hear from tests on other architectures
> and different topologies. You'll need to edit $arch/include/cpu.h and
> modify the define.
Instead of having the defined(__amd64__) || defined (__arm64__) this
should probably be a #define __HAVE_UVM_PCU inc <machine/cpu.h>.
Unless the goal is to convert all architectures that support
MULTIPROCESSOR kernels swiftly.
> This diff is really interesting because it now allows us to clearly see
> which syscall are contenting a lot. Without surprise it's kbind(2),
> munmap(2) and mprotect(2). It also shows which workloads are VFS-bound.
> That is what the "Buffer-Cache Cold" (BC Cold) numbers represent above.
> With a small number of CPUs we don't see much difference between the two.
>
> Comments?
A few more down below.
> Index: usr.bin/systat/uvm.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/systat/uvm.c,v
> diff -u -p -r1.6 uvm.c
> --- usr.bin/systat/uvm.c 27 Nov 2022 23:18:54 -0000 1.6
> +++ usr.bin/systat/uvm.c 12 Mar 2024 16:53:11 -0000
> @@ -80,11 +80,10 @@ struct uvmline uvmline[] = {
> { &uvmexp.zeropages, &last_uvmexp.zeropages, "zeropages",
> &uvmexp.pageins, &last_uvmexp.pageins, "pageins",
> &uvmexp.fltrelckok, &last_uvmexp.fltrelckok, "fltrelckok" },
> - { &uvmexp.reserve_pagedaemon, &last_uvmexp.reserve_pagedaemon,
> - "reserve_pagedaemon",
> + { &uvmexp.percpucaches, &last_uvmexp.percpucaches, "percpucaches",
> &uvmexp.pgswapin, &last_uvmexp.pgswapin, "pgswapin",
> &uvmexp.fltanget, &last_uvmexp.fltanget, "fltanget" },
> - { &uvmexp.reserve_kernel, &last_uvmexp.reserve_kernel, "reserve_kernel",
> + { NULL, NULL, NULL,
> &uvmexp.pgswapout, &last_uvmexp.pgswapout, "pgswapout",
> &uvmexp.fltanretry, &last_uvmexp.fltanretry, "fltanretry" },
> { NULL, NULL, NULL,
> @@ -143,13 +142,13 @@ struct uvmline uvmline[] = {
> NULL, NULL, NULL },
> { &uvmexp.pagesize, &last_uvmexp.pagesize, "pagesize",
> &uvmexp.pdpending, &last_uvmexp.pdpending, "pdpending",
> - NULL, NULL, NULL },
> + NULL, NULL, "Per-CPU Counters" },
> { &uvmexp.pagemask, &last_uvmexp.pagemask, "pagemask",
> &uvmexp.pddeact, &last_uvmexp.pddeact, "pddeact",
> - NULL, NULL, NULL },
> + &uvmexp.pcphit, &last_uvmexp.pcphit, "pcphit" },
> { &uvmexp.pageshift, &last_uvmexp.pageshift, "pageshift",
> NULL, NULL, NULL,
> - NULL, NULL, NULL }
> + &uvmexp.pcpmiss, &last_uvmexp.pcpmiss, "pcpmiss" }
> };
>
> field_def fields_uvm[] = {
> Index: usr.bin/vmstat/vmstat.c
> ===================================================================
> RCS file: /cvs/src/usr.bin/vmstat/vmstat.c,v
> diff -u -p -r1.155 vmstat.c
> --- usr.bin/vmstat/vmstat.c 4 Dec 2022 23:50:50 -0000 1.155
> +++ usr.bin/vmstat/vmstat.c 12 Mar 2024 16:49:56 -0000
> @@ -513,7 +513,12 @@ dosum(void)
> uvmexp.reserve_pagedaemon);
> (void)printf("%11u pages reserved for kernel\n",
> uvmexp.reserve_kernel);
> + (void)printf("%11u pages in per-cpu caches\n",
> + uvmexp.percpucaches);
>
> + /* per-cpu cache */
> + (void)printf("%11u per-cpu cache hits\n", uvmexp.pcphit);
> + (void)printf("%11u per-cpu cache misses\n", uvmexp.pcpmiss);
> /* swap */
> (void)printf("%11u swap pages\n", uvmexp.swpages);
> (void)printf("%11u swap pages in use\n", uvmexp.swpginuse);
> Index: sys/uvm/uvm.h
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm.h,v
> diff -u -p -r1.71 uvm.h
> --- sys/uvm/uvm.h 7 Oct 2022 05:01:44 -0000 1.71
> +++ sys/uvm/uvm.h 6 Mar 2024 15:10:33 -0000
> @@ -47,6 +47,7 @@
> *
> * Locks used to protect struct members in this file:
> * Q uvm.pageqlock
> + * F uvm.fpageqlock
> */
> struct uvm {
> /* vm_page related parameters */
> @@ -58,7 +59,7 @@ struct uvm {
> struct mutex pageqlock; /* [] lock for active/inactive page q */
> struct mutex fpageqlock; /* [] lock for free page q + pdaemon */
> boolean_t page_init_done; /* TRUE if uvm_page_init() finished */
> - struct uvm_pmr_control pmr_control; /* pmemrange data */
> + struct uvm_pmr_control pmr_control; /* [F] pmemrange data */
>
> /* page daemon trigger */
> int pagedaemon; /* daemon sleeps on this */
> Index: sys/uvm/uvm_page.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_page.c,v
> diff -u -p -r1.174 uvm_page.c
> --- sys/uvm/uvm_page.c 13 Feb 2024 10:16:28 -0000 1.174
> +++ sys/uvm/uvm_page.c 18 Mar 2024 17:37:17 -0000
> @@ -75,6 +75,7 @@
> #include <sys/smr.h>
>
> #include <uvm/uvm.h>
> +#include <uvm/uvm_percpu.h>
>
> /*
> * for object trees
> @@ -119,6 +120,8 @@ static vaddr_t virtual_space_end;
> static void uvm_pageinsert(struct vm_page *);
> static void uvm_pageremove(struct vm_page *);
> int uvm_page_owner_locked_p(struct vm_page *);
> +struct vm_page *uvm_pcpu_getpage(int);
> +int uvm_pcpu_putpage(struct vm_page *);
>
> /*
> * inline functions
> @@ -869,6 +872,7 @@ uvm_pagerealloc_multi(struct uvm_object
> return r;
> }
>
> +
> /*
> * uvm_pagealloc: allocate vm_page from a particular free list.
> *
> @@ -877,13 +881,11 @@ uvm_pagerealloc_multi(struct uvm_object
> * => only one of obj or anon can be non-null
> * => caller must activate/deactivate page if it is not wired.
> */
> -
Not sure what the purpose of moving a blank line is ;).
> struct vm_page *
> uvm_pagealloc(struct uvm_object *obj, voff_t off, struct vm_anon *anon,
> int flags)
> {
> - struct vm_page *pg;
> - struct pglist pgl;
> + struct vm_page *pg = NULL;
> int pmr_flags;
>
> KASSERT(obj == NULL || anon == NULL);
> @@ -906,12 +908,18 @@ uvm_pagealloc(struct uvm_object *obj, vo
>
> if (flags & UVM_PGA_ZERO)
> pmr_flags |= UVM_PLA_ZERO;
> - TAILQ_INIT(&pgl);
> - if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, pmr_flags, &pgl) != 0)
> - goto fail;
>
> - pg = TAILQ_FIRST(&pgl);
> - KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL);
> + pg = uvm_pcpu_getpage(pmr_flags);
> + if (pg == NULL) {
> + struct pglist pgl;
> +
> + TAILQ_INIT(&pgl);
> + if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, pmr_flags, &pgl) != 0)
> + goto fail;
> +
> + pg = TAILQ_FIRST(&pgl);
> + KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL);
> + }
Hmm, with the per-CPU caches enabled, retrying that uvm_pmr_getpages()
call probably isn't going to magically succeed when you try it again.
>
> uvm_pagealloc_pg(pg, obj, off, anon);
> KASSERT((pg->pg_flags & PG_DEV) == 0);
> @@ -1025,7 +1033,8 @@ void
> uvm_pagefree(struct vm_page *pg)
> {
> uvm_pageclean(pg);
> - uvm_pmr_freepages(pg, 1);
> + if (uvm_pcpu_putpage(pg) == 0)
> + uvm_pmr_freepages(pg, 1);
> }
And, the way this is done, you take the hit of having both a
conditional and function call. So I wonder if it makes sense to have
the no-cache uvm_pcpu_getpage() and uvm_pcpu_putpage() dummy
implementation do the alloc/free. Probably micro-optimizations, but I
think it'll make the code more readable.
>
> /*
> @@ -1398,3 +1407,120 @@ uvm_pagecount(struct uvm_constraint_rang
> }
> return sz;
> }
> +
> +#if defined(MULTIPROCESSOR) && (defined(__amd64__) || defined(__arm64__))
> +
> +/*
> + * uvm_pcpu_draincache: populate the current CPU cache
> + */
Wrong function name in the comment! If you ask me, it is a bit silly
to ave the function name in the comment in the first place. But
that's UVM for you.
> +int
> +uvm_pcpu_fillcache(void)
> +{
> + struct uvm_percpu *upc = &curcpu()->ci_uvm;
> + struct vm_page *pg;
> + struct pglist pgl;
> + int count;
> +
> + /* Leave one free room to not drain the cache at the next free. */
> + count = UVM_PCPU_MAXPAGES - upc->upc_count - 1;
> +
> + TAILQ_INIT(&pgl);
> + if (uvm_pmr_getpages(count, 0, 0, 1, 0, 1, UVM_PLA_NOWAIT, &pgl))
> + return -1;
> +
> + while ((pg = TAILQ_FIRST(&pgl)) != NULL) {
> + TAILQ_REMOVE(&pgl, pg, pageq);
> + upc->upc_pages[upc->upc_count] = pg;
> + upc->upc_count++;
> + }
> + atomic_add_int(&uvmexp.percpucaches, count);
> +
> + return 0;
> +}
> +
> +/*
> + * uvm_pcpu_getpage: allocate a page from the current CPU's cache
> + */
> +struct vm_page *
> +uvm_pcpu_getpage(int flags)
> +{
> + struct uvm_percpu *upc = &curcpu()->ci_uvm;
> + struct vm_page *pg;
> +
> + if (upc->upc_count == 0) {
> + atomic_inc_int(&uvmexp.pcpmiss);
> + if (uvm_pcpu_fillcache())
> + return NULL;
> + } else {
> + atomic_inc_int(&uvmexp.pcphit);
> + }
> +
> + atomic_dec_int(&uvmexp.percpucaches);
> + upc->upc_count--;
> + pg = upc->upc_pages[upc->upc_count];
> +
> + if (flags & UVM_PLA_ZERO)
> + uvm_pagezero(pg);
> +
> + return pg;
> +}
> +
> +/*
> + * uvm_pcpu_draincache: empty the current CPU cache
> + */
> +void
> +uvm_pcpu_draincache(void)
> +{
> + struct uvm_percpu *upc = &curcpu()->ci_uvm;
> + struct pglist pgl;
> + int i;
> +
> + TAILQ_INIT(&pgl);
> + for (i = 0; i < upc->upc_count; i++)
> + TAILQ_INSERT_TAIL(&pgl, upc->upc_pages[i], pageq);
> +
> + uvm_pmr_freepageq(&pgl);
> +
> + atomic_sub_int(&uvmexp.percpucaches, upc->upc_count);
> + upc->upc_count = 0;
> + memset(upc->upc_pages, 0, sizeof(upc->upc_pages));
> +}
> +
> +/*
> + * uvm_pcpu_putpage: free a page and place it on the current CPU's cache
> + */
> +int
> +uvm_pcpu_putpage(struct vm_page *pg)
> +{
> + struct uvm_percpu *upc = &curcpu()->ci_uvm;
> +
> + if (upc->upc_count >= UVM_PCPU_MAXPAGES)
> + uvm_pcpu_draincache();
> +
> + upc->upc_pages[upc->upc_count] = pg;
> + upc->upc_count++;
> + atomic_inc_int(&uvmexp.percpucaches);
> +
> + return 1;
> +}
> +
> +#else /* !MULTIPROCESSOR */
> +
> +struct vm_page *
> +uvm_pcpu_getpage(int flags)
> +{
> + return NULL;
> +}
> +
> +void
> +uvm_pcpu_draincache(void)
> +{
> +}
> +
> +int
> +uvm_pcpu_putpage(struct vm_page *pg)
> +{
> + return 0;
> +}
> +
> +#endif /* MULTIPROCESSOR */
> Index: sys/uvm/uvm_pmemrange.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_pmemrange.c,v
> diff -u -p -r1.63 uvm_pmemrange.c
> --- sys/uvm/uvm_pmemrange.c 10 Apr 2023 04:21:20 -0000 1.63
> +++ sys/uvm/uvm_pmemrange.c 11 Mar 2024 17:40:27 -0000
> @@ -1352,8 +1352,6 @@ uvm_pmr_freepageq(struct pglist *pgl)
> if (uvmexp.zeropages < UVM_PAGEZERO_TARGET)
> wakeup(&uvmexp.zeropages);
> uvm_unlock_fpageq();
> -
> - return;
> }
>
> /*
> Index: sys/uvm/uvmexp.h
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvmexp.h,v
> diff -u -p -r1.11 uvmexp.h
> --- sys/uvm/uvmexp.h 27 Oct 2023 19:18:53 -0000 1.11
> +++ sys/uvm/uvmexp.h 18 Mar 2024 17:35:24 -0000
> @@ -64,7 +64,7 @@ struct uvmexp {
> int zeropages; /* [F] number of zero'd pages */
> int reserve_pagedaemon; /* [I] # of pages reserved for pagedaemon */
> int reserve_kernel; /* [I] # of pages reserved for kernel */
> - int unused01; /* formerly anonpages */
> + int percpucaches; /* [a] # of pages in per-CPU caches */
> int vnodepages; /* XXX # of pages used by vnode page cache */
> int vtextpages; /* XXX # of pages used by vtext vnodes */
>
> @@ -99,8 +99,8 @@ struct uvmexp {
> int syscalls; /* system calls */
> int pageins; /* pagein operation count */
> /* pageouts are in pdpageouts below */
> - int unused07; /* formerly obsolete_swapins */
> - int unused08; /* formerly obsolete_swapouts */
> + int pcphit; /* [a] # of pagealloc from per-CPU cache */
> + int pcpmiss; /* [a] # of times a per-CPU cache was empty */
> int pgswapin; /* pages swapped in */
> int pgswapout; /* pages swapped out */
> int forks; /* forks */
> Index: sys/uvm/uvm_pdaemon.c
> ===================================================================
> RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v
> diff -u -p -r1.109 uvm_pdaemon.c
> --- sys/uvm/uvm_pdaemon.c 27 Oct 2023 19:18:53 -0000 1.109
> +++ sys/uvm/uvm_pdaemon.c 17 Mar 2024 17:54:02 -0000
> @@ -80,6 +80,7 @@
> #endif
>
> #include <uvm/uvm.h>
> +#include <uvm/uvm_percpu.h>
>
> #include "drm.h"
>
> @@ -276,6 +277,8 @@ uvm_pageout(void *arg)
> #if NDRM > 0
> drmbackoff(size * 2);
> #endif
> + uvm_pcpu_draincache();
> +
> uvm_lock_pageq();
>
> /*
> Index: sys/uvm/uvm_percpu.h
> ===================================================================
> RCS file: sys/uvm/uvm_percpu.h
> diff -N sys/uvm/uvm_percpu.h
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++ sys/uvm/uvm_percpu.h 17 Mar 2024 17:48:49 -0000
> @@ -0,0 +1,43 @@
> +/* $OpenBSD$ */
> +
> +/*
> + * Copyright (c) 2024 Martin Pieuchot <mpi@openbsd.org>
> + *
> + * Permission to use, copy, modify, and distribute this software for any
> + * purpose with or without fee is hereby granted, provided that the above
> + * copyright notice and this permission notice appear in all copies.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
> + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
> + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
> + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
> + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
> + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
> + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
> + */
> +
> +#ifndef _UVM_UVM_PCPU_H_
> +#define _UVM_UVM_PCPU_H_
> +
> +/*
> + * We want a per-CPU cache size to be as small as possible and at the
> + * same time gets rid of the `uvm_lock_fpageq' contention.
> + */
> +#define UVM_PCPU_MAXPAGES 16
> +
> +struct vm_page;
> +
> +/*
> + * Per-CPU page cache
> + *
> + * Locks used to protect struct members in this file:
> + * o owned (read/modified only) by its CPU
> + */
> +struct uvm_percpu {
> + struct vm_page *upc_pages[UVM_PCPU_MAXPAGES]; /* [o] */
> + int upc_count; /* [o] # of pages in cache */
> +};
> +
> +void uvm_pcpu_draincache(void);
> +
> +#endif /* _UVM_UVM_PCPU_H_ */
> Index: sys/arch/amd64/include/cpu.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v
> diff -u -p -r1.163 cpu.h
> --- sys/arch/amd64/include/cpu.h 25 Feb 2024 19:15:50 -0000 1.163
> +++ sys/arch/amd64/include/cpu.h 10 Mar 2024 17:33:55 -0000
> @@ -53,6 +53,7 @@
> #include <sys/sched.h>
> #include <sys/sensors.h>
> #include <sys/srp.h>
> +#include <uvm/uvm_percpu.h>
>
> #ifdef _KERNEL
>
> @@ -201,6 +202,7 @@ struct cpu_info {
>
> #ifdef MULTIPROCESSOR
> struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM];
> + struct uvm_percpu ci_uvm; /* [o] page cache */
> #endif
>
> struct ksensordev ci_sensordev;
> Index: sys/arch/arm64/include/cpu.h
> ===================================================================
> RCS file: /cvs/src/sys/arch/arm64/include/cpu.h,v
> diff -u -p -r1.43 cpu.h
> --- sys/arch/arm64/include/cpu.h 25 Feb 2024 19:15:50 -0000 1.43
> +++ sys/arch/arm64/include/cpu.h 12 Mar 2024 16:23:37 -0000
> @@ -108,6 +108,7 @@ void arm32_vector_init(vaddr_t, int);
> #include <sys/device.h>
> #include <sys/sched.h>
> #include <sys/srp.h>
> +#include <uvm/uvm_percpu.h>
>
> struct cpu_info {
> struct device *ci_dev; /* Device corresponding to this CPU */
> @@ -161,6 +162,7 @@ struct cpu_info {
>
> #ifdef MULTIPROCESSOR
> struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM];
> + struct uvm_percpu ci_uvm;
> volatile int ci_flags;
>
> volatile int ci_ddb_paused;
>
>
per-CPU page caches for page faults