From: Mark Kettenis Subject: Re: per-CPU page caches for page faults To: Martin Pieuchot Cc: tech@openbsd.org Date: Mon, 18 Mar 2024 21:32:35 +0100 > Date: Mon, 18 Mar 2024 20:13:43 +0100 > From: Martin Pieuchot > Content-Type: text/plain; charset=utf-8 > Content-Disposition: inline > > Diff below attaches a 16 page array to the "struct cpuinfo" and uses it > as a cache to reduce contention on the global pmemrange mutex. > > Measured performance improvements are between 7% to 13% with 16 CPUs > and 19% to 33% with 32 CPUs. -current OpenBSD doesn't scale above 32 > CPUs so it wouldn't be fair to compare number of jobs spread across > more CPUs. However, as you can see below, this limitation is no longer > true with this diff. > > kernel > ------ > 16: 1m47.93s real 11m24.18s user 10m55.78s system > 32: 2m33.30s real 11m46.08s user 32m32.35s system (BC cold) > 2m02.36s real 11m55.12s user 21m40.66s system > 64: 2m00.72s real 11m59.59s user 25m47.63s system > > libLLVM > ------- > 16: 30m45.54s real 363m25.35s user 150m34.05s system > 32: 24m29.88s real 409m49.80s user 311m02.54s system > 64: 29m22.63s real 404m16.20s user 771m31.26s system > 80: 30m12.49s real 398m07.01s user 816m01.71s system > > kernel+percpucaches(16) > ------ > 16: 1m30.17s real 11m19.29s user 6m42.08s system > 32: 2m02.28s real 11m42.13s user 23m42.64s system (BC cold) > 1m22.82s real 11m41.72s user 8m50.12s system > 64: 1m23.47s real 11m56.99s user 9m42.00s system > 80: 1m24.63s real 11m44.24s user 10m38.00s system > > libLLVM+percpucaches(16) > ------- > 16: 28m38.73s real 363m34.69s user 95m45.68s system > 32: 19m57.71s real 415m17.23s user 174m47.83s system > 64: 18m59.50s real 450m17.79s user 406m05.42s system > 80: 19m02.26s real 452m35.11s user 473m09.05s system > > Still the most important impact of this diff is the reduction of %sys > time. It drops from ~40% with 16 CPUs and ~55% with 32 CPUs or more. Not much improvement on my 10-core M2 Pro system: kernel ------ 1m05.10s real 5m14.56s user 2m53.34s system kernel+percpucaches(16) ------ 1m03.92s real 5m06.53s user 2m42.14s system but still an improvement. Guess I should get more cores next time I upgrade ;). > What is the idea behind this diff? With a consequent number of CPUs (16 > or more) grabbing a global mutex for every page allocation & free creates > a lot of contention resulting in many CPU cycles wasted in system (kernel) > time. The idea of this diff is to add another layer on top of the global > allocator to allocate and free pages in batch. Note that, in this diff, > this cache is only used for page faults. > > The number of 16 has been chosen after careful testing on a 80 CPU Ampere > machine. It tried to keep it as small as possible while making sure that > multiple parallel page faults on a large number of CPUs do not result in > contention. I'd argue that "stealing" at most 64k per CPU is acceptable > on any MP system. Yes, and you do drain the cache, so if we really need that memory, most it will be available. > The diff includes 3 new counters visible in "systat uvm" and "vmstat -s". > > When the page daemon kicks in we drain the cache of the current CPU which > is the best we can do without adding too much complexity. There is an interesting trade-off to be made between bunching allocs and frees and having free pages in the cache. You opted to use 15 of the 16 slots for bunching allocs/frees. Did you explore a more even split like 8/8? Mostly just curious. > I only tested amd64 and arm64, that's why there is such define in > uvm/uvm_page.c. I'd be happy to hear from tests on other architectures > and different topologies. You'll need to edit $arch/include/cpu.h and > modify the define. Instead of having the defined(__amd64__) || defined (__arm64__) this should probably be a #define __HAVE_UVM_PCU inc . Unless the goal is to convert all architectures that support MULTIPROCESSOR kernels swiftly. > This diff is really interesting because it now allows us to clearly see > which syscall are contenting a lot. Without surprise it's kbind(2), > munmap(2) and mprotect(2). It also shows which workloads are VFS-bound. > That is what the "Buffer-Cache Cold" (BC Cold) numbers represent above. > With a small number of CPUs we don't see much difference between the two. > > Comments? A few more down below. > Index: usr.bin/systat/uvm.c > =================================================================== > RCS file: /cvs/src/usr.bin/systat/uvm.c,v > diff -u -p -r1.6 uvm.c > --- usr.bin/systat/uvm.c 27 Nov 2022 23:18:54 -0000 1.6 > +++ usr.bin/systat/uvm.c 12 Mar 2024 16:53:11 -0000 > @@ -80,11 +80,10 @@ struct uvmline uvmline[] = { > { &uvmexp.zeropages, &last_uvmexp.zeropages, "zeropages", > &uvmexp.pageins, &last_uvmexp.pageins, "pageins", > &uvmexp.fltrelckok, &last_uvmexp.fltrelckok, "fltrelckok" }, > - { &uvmexp.reserve_pagedaemon, &last_uvmexp.reserve_pagedaemon, > - "reserve_pagedaemon", > + { &uvmexp.percpucaches, &last_uvmexp.percpucaches, "percpucaches", > &uvmexp.pgswapin, &last_uvmexp.pgswapin, "pgswapin", > &uvmexp.fltanget, &last_uvmexp.fltanget, "fltanget" }, > - { &uvmexp.reserve_kernel, &last_uvmexp.reserve_kernel, "reserve_kernel", > + { NULL, NULL, NULL, > &uvmexp.pgswapout, &last_uvmexp.pgswapout, "pgswapout", > &uvmexp.fltanretry, &last_uvmexp.fltanretry, "fltanretry" }, > { NULL, NULL, NULL, > @@ -143,13 +142,13 @@ struct uvmline uvmline[] = { > NULL, NULL, NULL }, > { &uvmexp.pagesize, &last_uvmexp.pagesize, "pagesize", > &uvmexp.pdpending, &last_uvmexp.pdpending, "pdpending", > - NULL, NULL, NULL }, > + NULL, NULL, "Per-CPU Counters" }, > { &uvmexp.pagemask, &last_uvmexp.pagemask, "pagemask", > &uvmexp.pddeact, &last_uvmexp.pddeact, "pddeact", > - NULL, NULL, NULL }, > + &uvmexp.pcphit, &last_uvmexp.pcphit, "pcphit" }, > { &uvmexp.pageshift, &last_uvmexp.pageshift, "pageshift", > NULL, NULL, NULL, > - NULL, NULL, NULL } > + &uvmexp.pcpmiss, &last_uvmexp.pcpmiss, "pcpmiss" } > }; > > field_def fields_uvm[] = { > Index: usr.bin/vmstat/vmstat.c > =================================================================== > RCS file: /cvs/src/usr.bin/vmstat/vmstat.c,v > diff -u -p -r1.155 vmstat.c > --- usr.bin/vmstat/vmstat.c 4 Dec 2022 23:50:50 -0000 1.155 > +++ usr.bin/vmstat/vmstat.c 12 Mar 2024 16:49:56 -0000 > @@ -513,7 +513,12 @@ dosum(void) > uvmexp.reserve_pagedaemon); > (void)printf("%11u pages reserved for kernel\n", > uvmexp.reserve_kernel); > + (void)printf("%11u pages in per-cpu caches\n", > + uvmexp.percpucaches); > > + /* per-cpu cache */ > + (void)printf("%11u per-cpu cache hits\n", uvmexp.pcphit); > + (void)printf("%11u per-cpu cache misses\n", uvmexp.pcpmiss); > /* swap */ > (void)printf("%11u swap pages\n", uvmexp.swpages); > (void)printf("%11u swap pages in use\n", uvmexp.swpginuse); > Index: sys/uvm/uvm.h > =================================================================== > RCS file: /cvs/src/sys/uvm/uvm.h,v > diff -u -p -r1.71 uvm.h > --- sys/uvm/uvm.h 7 Oct 2022 05:01:44 -0000 1.71 > +++ sys/uvm/uvm.h 6 Mar 2024 15:10:33 -0000 > @@ -47,6 +47,7 @@ > * > * Locks used to protect struct members in this file: > * Q uvm.pageqlock > + * F uvm.fpageqlock > */ > struct uvm { > /* vm_page related parameters */ > @@ -58,7 +59,7 @@ struct uvm { > struct mutex pageqlock; /* [] lock for active/inactive page q */ > struct mutex fpageqlock; /* [] lock for free page q + pdaemon */ > boolean_t page_init_done; /* TRUE if uvm_page_init() finished */ > - struct uvm_pmr_control pmr_control; /* pmemrange data */ > + struct uvm_pmr_control pmr_control; /* [F] pmemrange data */ > > /* page daemon trigger */ > int pagedaemon; /* daemon sleeps on this */ > Index: sys/uvm/uvm_page.c > =================================================================== > RCS file: /cvs/src/sys/uvm/uvm_page.c,v > diff -u -p -r1.174 uvm_page.c > --- sys/uvm/uvm_page.c 13 Feb 2024 10:16:28 -0000 1.174 > +++ sys/uvm/uvm_page.c 18 Mar 2024 17:37:17 -0000 > @@ -75,6 +75,7 @@ > #include > > #include > +#include > > /* > * for object trees > @@ -119,6 +120,8 @@ static vaddr_t virtual_space_end; > static void uvm_pageinsert(struct vm_page *); > static void uvm_pageremove(struct vm_page *); > int uvm_page_owner_locked_p(struct vm_page *); > +struct vm_page *uvm_pcpu_getpage(int); > +int uvm_pcpu_putpage(struct vm_page *); > > /* > * inline functions > @@ -869,6 +872,7 @@ uvm_pagerealloc_multi(struct uvm_object > return r; > } > > + > /* > * uvm_pagealloc: allocate vm_page from a particular free list. > * > @@ -877,13 +881,11 @@ uvm_pagerealloc_multi(struct uvm_object > * => only one of obj or anon can be non-null > * => caller must activate/deactivate page if it is not wired. > */ > - Not sure what the purpose of moving a blank line is ;). > struct vm_page * > uvm_pagealloc(struct uvm_object *obj, voff_t off, struct vm_anon *anon, > int flags) > { > - struct vm_page *pg; > - struct pglist pgl; > + struct vm_page *pg = NULL; > int pmr_flags; > > KASSERT(obj == NULL || anon == NULL); > @@ -906,12 +908,18 @@ uvm_pagealloc(struct uvm_object *obj, vo > > if (flags & UVM_PGA_ZERO) > pmr_flags |= UVM_PLA_ZERO; > - TAILQ_INIT(&pgl); > - if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, pmr_flags, &pgl) != 0) > - goto fail; > > - pg = TAILQ_FIRST(&pgl); > - KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL); > + pg = uvm_pcpu_getpage(pmr_flags); > + if (pg == NULL) { > + struct pglist pgl; > + > + TAILQ_INIT(&pgl); > + if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, pmr_flags, &pgl) != 0) > + goto fail; > + > + pg = TAILQ_FIRST(&pgl); > + KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL); > + } Hmm, with the per-CPU caches enabled, retrying that uvm_pmr_getpages() call probably isn't going to magically succeed when you try it again. > > uvm_pagealloc_pg(pg, obj, off, anon); > KASSERT((pg->pg_flags & PG_DEV) == 0); > @@ -1025,7 +1033,8 @@ void > uvm_pagefree(struct vm_page *pg) > { > uvm_pageclean(pg); > - uvm_pmr_freepages(pg, 1); > + if (uvm_pcpu_putpage(pg) == 0) > + uvm_pmr_freepages(pg, 1); > } And, the way this is done, you take the hit of having both a conditional and function call. So I wonder if it makes sense to have the no-cache uvm_pcpu_getpage() and uvm_pcpu_putpage() dummy implementation do the alloc/free. Probably micro-optimizations, but I think it'll make the code more readable. > > /* > @@ -1398,3 +1407,120 @@ uvm_pagecount(struct uvm_constraint_rang > } > return sz; > } > + > +#if defined(MULTIPROCESSOR) && (defined(__amd64__) || defined(__arm64__)) > + > +/* > + * uvm_pcpu_draincache: populate the current CPU cache > + */ Wrong function name in the comment! If you ask me, it is a bit silly to ave the function name in the comment in the first place. But that's UVM for you. > +int > +uvm_pcpu_fillcache(void) > +{ > + struct uvm_percpu *upc = &curcpu()->ci_uvm; > + struct vm_page *pg; > + struct pglist pgl; > + int count; > + > + /* Leave one free room to not drain the cache at the next free. */ > + count = UVM_PCPU_MAXPAGES - upc->upc_count - 1; > + > + TAILQ_INIT(&pgl); > + if (uvm_pmr_getpages(count, 0, 0, 1, 0, 1, UVM_PLA_NOWAIT, &pgl)) > + return -1; > + > + while ((pg = TAILQ_FIRST(&pgl)) != NULL) { > + TAILQ_REMOVE(&pgl, pg, pageq); > + upc->upc_pages[upc->upc_count] = pg; > + upc->upc_count++; > + } > + atomic_add_int(&uvmexp.percpucaches, count); > + > + return 0; > +} > + > +/* > + * uvm_pcpu_getpage: allocate a page from the current CPU's cache > + */ > +struct vm_page * > +uvm_pcpu_getpage(int flags) > +{ > + struct uvm_percpu *upc = &curcpu()->ci_uvm; > + struct vm_page *pg; > + > + if (upc->upc_count == 0) { > + atomic_inc_int(&uvmexp.pcpmiss); > + if (uvm_pcpu_fillcache()) > + return NULL; > + } else { > + atomic_inc_int(&uvmexp.pcphit); > + } > + > + atomic_dec_int(&uvmexp.percpucaches); > + upc->upc_count--; > + pg = upc->upc_pages[upc->upc_count]; > + > + if (flags & UVM_PLA_ZERO) > + uvm_pagezero(pg); > + > + return pg; > +} > + > +/* > + * uvm_pcpu_draincache: empty the current CPU cache > + */ > +void > +uvm_pcpu_draincache(void) > +{ > + struct uvm_percpu *upc = &curcpu()->ci_uvm; > + struct pglist pgl; > + int i; > + > + TAILQ_INIT(&pgl); > + for (i = 0; i < upc->upc_count; i++) > + TAILQ_INSERT_TAIL(&pgl, upc->upc_pages[i], pageq); > + > + uvm_pmr_freepageq(&pgl); > + > + atomic_sub_int(&uvmexp.percpucaches, upc->upc_count); > + upc->upc_count = 0; > + memset(upc->upc_pages, 0, sizeof(upc->upc_pages)); > +} > + > +/* > + * uvm_pcpu_putpage: free a page and place it on the current CPU's cache > + */ > +int > +uvm_pcpu_putpage(struct vm_page *pg) > +{ > + struct uvm_percpu *upc = &curcpu()->ci_uvm; > + > + if (upc->upc_count >= UVM_PCPU_MAXPAGES) > + uvm_pcpu_draincache(); > + > + upc->upc_pages[upc->upc_count] = pg; > + upc->upc_count++; > + atomic_inc_int(&uvmexp.percpucaches); > + > + return 1; > +} > + > +#else /* !MULTIPROCESSOR */ > + > +struct vm_page * > +uvm_pcpu_getpage(int flags) > +{ > + return NULL; > +} > + > +void > +uvm_pcpu_draincache(void) > +{ > +} > + > +int > +uvm_pcpu_putpage(struct vm_page *pg) > +{ > + return 0; > +} > + > +#endif /* MULTIPROCESSOR */ > Index: sys/uvm/uvm_pmemrange.c > =================================================================== > RCS file: /cvs/src/sys/uvm/uvm_pmemrange.c,v > diff -u -p -r1.63 uvm_pmemrange.c > --- sys/uvm/uvm_pmemrange.c 10 Apr 2023 04:21:20 -0000 1.63 > +++ sys/uvm/uvm_pmemrange.c 11 Mar 2024 17:40:27 -0000 > @@ -1352,8 +1352,6 @@ uvm_pmr_freepageq(struct pglist *pgl) > if (uvmexp.zeropages < UVM_PAGEZERO_TARGET) > wakeup(&uvmexp.zeropages); > uvm_unlock_fpageq(); > - > - return; > } > > /* > Index: sys/uvm/uvmexp.h > =================================================================== > RCS file: /cvs/src/sys/uvm/uvmexp.h,v > diff -u -p -r1.11 uvmexp.h > --- sys/uvm/uvmexp.h 27 Oct 2023 19:18:53 -0000 1.11 > +++ sys/uvm/uvmexp.h 18 Mar 2024 17:35:24 -0000 > @@ -64,7 +64,7 @@ struct uvmexp { > int zeropages; /* [F] number of zero'd pages */ > int reserve_pagedaemon; /* [I] # of pages reserved for pagedaemon */ > int reserve_kernel; /* [I] # of pages reserved for kernel */ > - int unused01; /* formerly anonpages */ > + int percpucaches; /* [a] # of pages in per-CPU caches */ > int vnodepages; /* XXX # of pages used by vnode page cache */ > int vtextpages; /* XXX # of pages used by vtext vnodes */ > > @@ -99,8 +99,8 @@ struct uvmexp { > int syscalls; /* system calls */ > int pageins; /* pagein operation count */ > /* pageouts are in pdpageouts below */ > - int unused07; /* formerly obsolete_swapins */ > - int unused08; /* formerly obsolete_swapouts */ > + int pcphit; /* [a] # of pagealloc from per-CPU cache */ > + int pcpmiss; /* [a] # of times a per-CPU cache was empty */ > int pgswapin; /* pages swapped in */ > int pgswapout; /* pages swapped out */ > int forks; /* forks */ > Index: sys/uvm/uvm_pdaemon.c > =================================================================== > RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v > diff -u -p -r1.109 uvm_pdaemon.c > --- sys/uvm/uvm_pdaemon.c 27 Oct 2023 19:18:53 -0000 1.109 > +++ sys/uvm/uvm_pdaemon.c 17 Mar 2024 17:54:02 -0000 > @@ -80,6 +80,7 @@ > #endif > > #include > +#include > > #include "drm.h" > > @@ -276,6 +277,8 @@ uvm_pageout(void *arg) > #if NDRM > 0 > drmbackoff(size * 2); > #endif > + uvm_pcpu_draincache(); > + > uvm_lock_pageq(); > > /* > Index: sys/uvm/uvm_percpu.h > =================================================================== > RCS file: sys/uvm/uvm_percpu.h > diff -N sys/uvm/uvm_percpu.h > --- /dev/null 1 Jan 1970 00:00:00 -0000 > +++ sys/uvm/uvm_percpu.h 17 Mar 2024 17:48:49 -0000 > @@ -0,0 +1,43 @@ > +/* $OpenBSD$ */ > + > +/* > + * Copyright (c) 2024 Martin Pieuchot > + * > + * Permission to use, copy, modify, and distribute this software for any > + * purpose with or without fee is hereby granted, provided that the above > + * copyright notice and this permission notice appear in all copies. > + * > + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES > + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF > + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR > + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES > + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN > + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF > + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. > + */ > + > +#ifndef _UVM_UVM_PCPU_H_ > +#define _UVM_UVM_PCPU_H_ > + > +/* > + * We want a per-CPU cache size to be as small as possible and at the > + * same time gets rid of the `uvm_lock_fpageq' contention. > + */ > +#define UVM_PCPU_MAXPAGES 16 > + > +struct vm_page; > + > +/* > + * Per-CPU page cache > + * > + * Locks used to protect struct members in this file: > + * o owned (read/modified only) by its CPU > + */ > +struct uvm_percpu { > + struct vm_page *upc_pages[UVM_PCPU_MAXPAGES]; /* [o] */ > + int upc_count; /* [o] # of pages in cache */ > +}; > + > +void uvm_pcpu_draincache(void); > + > +#endif /* _UVM_UVM_PCPU_H_ */ > Index: sys/arch/amd64/include/cpu.h > =================================================================== > RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v > diff -u -p -r1.163 cpu.h > --- sys/arch/amd64/include/cpu.h 25 Feb 2024 19:15:50 -0000 1.163 > +++ sys/arch/amd64/include/cpu.h 10 Mar 2024 17:33:55 -0000 > @@ -53,6 +53,7 @@ > #include > #include > #include > +#include > > #ifdef _KERNEL > > @@ -201,6 +202,7 @@ struct cpu_info { > > #ifdef MULTIPROCESSOR > struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM]; > + struct uvm_percpu ci_uvm; /* [o] page cache */ > #endif > > struct ksensordev ci_sensordev; > Index: sys/arch/arm64/include/cpu.h > =================================================================== > RCS file: /cvs/src/sys/arch/arm64/include/cpu.h,v > diff -u -p -r1.43 cpu.h > --- sys/arch/arm64/include/cpu.h 25 Feb 2024 19:15:50 -0000 1.43 > +++ sys/arch/arm64/include/cpu.h 12 Mar 2024 16:23:37 -0000 > @@ -108,6 +108,7 @@ void arm32_vector_init(vaddr_t, int); > #include > #include > #include > +#include > > struct cpu_info { > struct device *ci_dev; /* Device corresponding to this CPU */ > @@ -161,6 +162,7 @@ struct cpu_info { > > #ifdef MULTIPROCESSOR > struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM]; > + struct uvm_percpu ci_uvm; > volatile int ci_flags; > > volatile int ci_ddb_paused; > >