From: Martin Pieuchot Subject: Re: per-CPU page caches for page faults To: tech@openbsd.org Date: Sun, 31 Mar 2024 19:00:30 +0200 On 19/03/24(Tue) 15:06, David Gwynne wrote: > On Mon, Mar 18, 2024 at 08:13:43PM +0100, Martin Pieuchot wrote: > > Diff below attaches a 16 page array to the "struct cpuinfo" and uses it > > as a cache to reduce contention on the global pmemrange mutex. > > > > Measured performance improvements are between 7% to 13% with 16 CPUs > > and 19% to 33% with 32 CPUs. -current OpenBSD doesn't scale above 32 > > CPUs so it wouldn't be fair to compare number of jobs spread across > > more CPUs. However, as you can see below, this limitation is no longer > > true with this diff. > > > > kernel > > ------ > > 16: 1m47.93s real 11m24.18s user 10m55.78s system > > 32: 2m33.30s real 11m46.08s user 32m32.35s system (BC cold) > > 2m02.36s real 11m55.12s user 21m40.66s system > > 64: 2m00.72s real 11m59.59s user 25m47.63s system > > > > libLLVM > > ------- > > 16: 30m45.54s real 363m25.35s user 150m34.05s system > > 32: 24m29.88s real 409m49.80s user 311m02.54s system > > 64: 29m22.63s real 404m16.20s user 771m31.26s system > > 80: 30m12.49s real 398m07.01s user 816m01.71s system > > > > kernel+percpucaches(16) > > ------ > > 16: 1m30.17s real 11m19.29s user 6m42.08s system > > 32: 2m02.28s real 11m42.13s user 23m42.64s system (BC cold) > > 1m22.82s real 11m41.72s user 8m50.12s system > > 64: 1m23.47s real 11m56.99s user 9m42.00s system > > 80: 1m24.63s real 11m44.24s user 10m38.00s system > > > > libLLVM+percpucaches(16) > > ------- > > 16: 28m38.73s real 363m34.69s user 95m45.68s system > > 32: 19m57.71s real 415m17.23s user 174m47.83s system > > 64: 18m59.50s real 450m17.79s user 406m05.42s system > > 80: 19m02.26s real 452m35.11s user 473m09.05s system > > > > Still the most important impact of this diff is the reduction of %sys > > time. It drops from ~40% with 16 CPUs and ~55% with 32 CPUs or more. > > > > What is the idea behind this diff? With a consequent number of CPUs (16 > > or more) grabbing a global mutex for every page allocation & free creates > > a lot of contention resulting in many CPU cycles wasted in system (kernel) > > time. The idea of this diff is to add another layer on top of the global > > allocator to allocate and free pages in batch. Note that, in this diff, > > this cache is only used for page faults. > > > > The number of 16 has been chosen after careful testing on a 80 CPU Ampere > > machine. It tried to keep it as small as possible while making sure that > > multiple parallel page faults on a large number of CPUs do not result in > > contention. I'd argue that "stealing" at most 64k per CPU is acceptable > > on any MP system. > > > > The diff includes 3 new counters visible in "systat uvm" and "vmstat -s". > > > > When the page daemon kicks in we drain the cache of the current CPU which > > is the best we can do without adding too much complexity. > > > > I only tested amd64 and arm64, that's why there is such define in > > uvm/uvm_page.c. I'd be happy to hear from tests on other architectures > > and different topologies. You'll need to edit $arch/include/cpu.h and > > modify the define. > > > > This diff is really interesting because it now allows us to clearly see > > which syscall are contenting a lot. Without surprise it's kbind(2), > > munmap(2) and mprotect(2). It also shows which workloads are VFS-bound. > > That is what the "Buffer-Cache Cold" (BC Cold) numbers represent above. > > With a small number of CPUs we don't see much difference between the two. > > > > Comments? > > i like the idea, and i like the improvements. > > this is basically the same problem that jeff bonwick deals with in > his magazines and vmem paper about the changes he made to the solaris > slab allocator to make it scale on machines with a bunch of cpus. > that's the reference i used when i implemented per cpu caches in > pools, and it's probably worth following here as well. the only > real change i'd want you to make is to introduce the "previously > loaded magazine" to mitigate thrashing as per section 3.1 in the > paper. > > pretty exciting though. New version that should address all previous comments: - Use 2 magazines of 8 pages and imitate the pool_cache code. The miss/hit ratio can be observed to be 1/8 with "systat uvm". - Ensure that uvm_pmr_getpages() won't fail with highly fragmented memory and do not wakup the pagedaemon if it fails to fully reload a magazine. - Use __HAVE_UVM_PERCPU & provide UP versions of cache_get/cache_put(). - Change amap_wipeout() to call uvm_anfree() to fill the cache instead of bypassing it by calling uvm_pglistfree(). - Include a fix for incorrect decrementing of `uvm.swpgonly' in uvm_anon_release() (should be committed independently). I didn't do any measurement with this version but robert@ said it shave off 30 minutes compared to the previous one for a chromium build w/ 32 CPUs (from 4.5h down to 4h). Comments? Tests? Index: usr.bin/systat/uvm.c =================================================================== RCS file: /cvs/src/usr.bin/systat/uvm.c,v diff -u -p -r1.6 uvm.c --- usr.bin/systat/uvm.c 27 Nov 2022 23:18:54 -0000 1.6 +++ usr.bin/systat/uvm.c 29 Mar 2024 20:56:32 -0000 @@ -80,11 +80,10 @@ struct uvmline uvmline[] = { { &uvmexp.zeropages, &last_uvmexp.zeropages, "zeropages", &uvmexp.pageins, &last_uvmexp.pageins, "pageins", &uvmexp.fltrelckok, &last_uvmexp.fltrelckok, "fltrelckok" }, - { &uvmexp.reserve_pagedaemon, &last_uvmexp.reserve_pagedaemon, - "reserve_pagedaemon", + { &uvmexp.percpucaches, &last_uvmexp.percpucaches, "percpucaches", &uvmexp.pgswapin, &last_uvmexp.pgswapin, "pgswapin", &uvmexp.fltanget, &last_uvmexp.fltanget, "fltanget" }, - { &uvmexp.reserve_kernel, &last_uvmexp.reserve_kernel, "reserve_kernel", + { NULL, NULL, NULL, &uvmexp.pgswapout, &last_uvmexp.pgswapout, "pgswapout", &uvmexp.fltanretry, &last_uvmexp.fltanretry, "fltanretry" }, { NULL, NULL, NULL, @@ -143,13 +142,13 @@ struct uvmline uvmline[] = { NULL, NULL, NULL }, { &uvmexp.pagesize, &last_uvmexp.pagesize, "pagesize", &uvmexp.pdpending, &last_uvmexp.pdpending, "pdpending", - NULL, NULL, NULL }, + NULL, NULL, "Per-CPU Counters" }, { &uvmexp.pagemask, &last_uvmexp.pagemask, "pagemask", &uvmexp.pddeact, &last_uvmexp.pddeact, "pddeact", - NULL, NULL, NULL }, + &uvmexp.pcphit, &last_uvmexp.pcphit, "pcphit" }, { &uvmexp.pageshift, &last_uvmexp.pageshift, "pageshift", NULL, NULL, NULL, - NULL, NULL, NULL } + &uvmexp.pcpmiss, &last_uvmexp.pcpmiss, "pcpmiss" } }; field_def fields_uvm[] = { Index: usr.bin/vmstat/vmstat.c =================================================================== RCS file: /cvs/src/usr.bin/vmstat/vmstat.c,v diff -u -p -r1.155 vmstat.c --- usr.bin/vmstat/vmstat.c 4 Dec 2022 23:50:50 -0000 1.155 +++ usr.bin/vmstat/vmstat.c 29 Mar 2024 20:56:32 -0000 @@ -513,7 +513,12 @@ dosum(void) uvmexp.reserve_pagedaemon); (void)printf("%11u pages reserved for kernel\n", uvmexp.reserve_kernel); + (void)printf("%11u pages in per-cpu caches\n", + uvmexp.percpucaches); + /* per-cpu cache */ + (void)printf("%11u per-cpu cache hits\n", uvmexp.pcphit); + (void)printf("%11u per-cpu cache misses\n", uvmexp.pcpmiss); /* swap */ (void)printf("%11u swap pages\n", uvmexp.swpages); (void)printf("%11u swap pages in use\n", uvmexp.swpginuse); Index: uvm/uvm_amap.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_amap.c,v diff -u -p -r1.92 uvm_amap.c --- sys/uvm/uvm_amap.c 11 Apr 2023 00:45:09 -0000 1.92 +++ sys/uvm/uvm_amap.c 30 Mar 2024 17:30:10 -0000 @@ -482,7 +482,6 @@ amap_wipeout(struct vm_amap *amap) int slot; struct vm_anon *anon; struct vm_amap_chunk *chunk; - struct pglist pgl; KASSERT(rw_write_held(amap->am_lock)); KASSERT(amap->am_ref == 0); @@ -495,7 +494,6 @@ amap_wipeout(struct vm_amap *amap) return; } - TAILQ_INIT(&pgl); amap_list_remove(amap); AMAP_CHUNK_FOREACH(chunk, amap) { @@ -515,12 +513,10 @@ amap_wipeout(struct vm_amap *amap) */ refs = --anon->an_ref; if (refs == 0) { - uvm_anfree_list(anon, &pgl); + uvm_anfree(anon); } } } - /* free the pages */ - uvm_pglistfree(&pgl); /* * Finally, destroy the amap. Index: sys/uvm/uvm_anon.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_anon.c,v diff -u -p -r1.57 uvm_anon.c --- sys/uvm/uvm_anon.c 27 Oct 2023 19:13:51 -0000 1.57 +++ sys/uvm/uvm_anon.c 30 Mar 2024 09:21:19 -0000 @@ -116,7 +116,7 @@ uvm_anfree_list(struct vm_anon *anon, st uvm_unlock_pageq(); /* free the daemon */ } } else { - if (anon->an_swslot != 0 && anon->an_swslot != SWSLOT_BAD) { + if (anon->an_swslot > 0) { /* This page is no longer only in swap. */ KASSERT(uvmexp.swpgonly > 0); atomic_dec_int(&uvmexp.swpgonly); @@ -260,7 +260,8 @@ uvm_anon_release(struct vm_anon *anon) uvm_unlock_pageq(); KASSERT(anon->an_page == NULL); lock = anon->an_lock; - uvm_anfree(anon); + uvm_anon_dropswap(anon); + pool_put(&uvm_anon_pool, anon); rw_exit(lock); /* Note: extra reference is held for PG_RELEASED case. */ rw_obj_free(lock); Index: sys/uvm/uvm_page.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_page.c,v diff -u -p -r1.174 uvm_page.c --- sys/uvm/uvm_page.c 13 Feb 2024 10:16:28 -0000 1.174 +++ sys/uvm/uvm_page.c 31 Mar 2024 12:16:46 -0000 @@ -75,6 +75,7 @@ #include #include +#include /* * for object trees @@ -120,6 +121,10 @@ static void uvm_pageinsert(struct vm_pag static void uvm_pageremove(struct vm_page *); int uvm_page_owner_locked_p(struct vm_page *); +struct vm_page *uvm_pmr_getone(int); +struct vm_page *uvm_pmr_cache_get(int); +void uvm_pmr_cache_put(struct vm_page *); + /* * inline functions */ @@ -877,13 +882,11 @@ uvm_pagerealloc_multi(struct uvm_object * => only one of obj or anon can be non-null * => caller must activate/deactivate page if it is not wired. */ - struct vm_page * uvm_pagealloc(struct uvm_object *obj, voff_t off, struct vm_anon *anon, int flags) { - struct vm_page *pg; - struct pglist pgl; + struct vm_page *pg = NULL; int pmr_flags; KASSERT(obj == NULL || anon == NULL); @@ -906,13 +909,10 @@ uvm_pagealloc(struct uvm_object *obj, vo if (flags & UVM_PGA_ZERO) pmr_flags |= UVM_PLA_ZERO; - TAILQ_INIT(&pgl); - if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, pmr_flags, &pgl) != 0) - goto fail; - - pg = TAILQ_FIRST(&pgl); - KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL); + pg = uvm_pmr_cache_get(pmr_flags); + if (pg == NULL) + return NULL; uvm_pagealloc_pg(pg, obj, off, anon); KASSERT((pg->pg_flags & PG_DEV) == 0); if (flags & UVM_PGA_ZERO) @@ -921,9 +921,6 @@ uvm_pagealloc(struct uvm_object *obj, vo atomic_setbits_int(&pg->pg_flags, PG_CLEAN); return pg; - -fail: - return NULL; } /* @@ -1025,7 +1022,7 @@ void uvm_pagefree(struct vm_page *pg) { uvm_pageclean(pg); - uvm_pmr_freepages(pg, 1); + uvm_pmr_cache_put(pg); } /* @@ -1398,3 +1395,153 @@ uvm_pagecount(struct uvm_constraint_rang } return sz; } + +struct vm_page * +uvm_pmr_getone(int flags) +{ + struct vm_page *pg; + struct pglist pgl; + + TAILQ_INIT(&pgl); + if (uvm_pmr_getpages(1, 0, 0, 1, 0, 1, flags, &pgl) != 0) + return NULL; + + pg = TAILQ_FIRST(&pgl); + KASSERT(pg != NULL && TAILQ_NEXT(pg, pageq) == NULL); + + return pg; +} + +#if defined(MULTIPROCESSOR) && defined(__HAVE_UVM_PERCPU) + +/* + * Reload a magazine. + */ +int +uvm_pmr_cache_alloc(struct uvm_pmr_cache_item *upci) +{ + struct vm_page *pg; + struct pglist pgl; + int flags = UVM_PLA_NOWAIT|UVM_PLA_NOWAKE; + int npages = UVM_PMR_CACHEMAGSZ; + + KASSERT(upci->upci_npages == 0); + + TAILQ_INIT(&pgl); + if (uvm_pmr_getpages(npages, 0, 0, 1, 0, npages, flags, &pgl)) + return -1; + + while ((pg = TAILQ_FIRST(&pgl)) != NULL) { + TAILQ_REMOVE(&pgl, pg, pageq); + upci->upci_pages[upci->upci_npages] = pg; + upci->upci_npages++; + } + atomic_add_int(&uvmexp.percpucaches, npages); + + return 0; +} + +struct vm_page * +uvm_pmr_cache_get(int flags) +{ + struct uvm_pmr_cache *upc = &curcpu()->ci_uvm; + struct uvm_pmr_cache_item *upci; + struct vm_page *pg; + + upci = &upc->upc_magz[upc->upc_actv]; + if (upci->upci_npages == 0) { + unsigned int prev; + + prev = (upc->upc_actv == 0) ? 1 : 0; + upci = &upc->upc_magz[prev]; + if (upci->upci_npages == 0) { + atomic_inc_int(&uvmexp.pcpmiss); + if (uvm_pmr_cache_alloc(upci)) + return uvm_pmr_getone(flags); + } + /* Swap magazines */ + upc->upc_actv = prev; + } else { + atomic_inc_int(&uvmexp.pcphit); + } + + atomic_dec_int(&uvmexp.percpucaches); + upci->upci_npages--; + pg = upci->upci_pages[upci->upci_npages]; + + if (flags & UVM_PLA_ZERO) + uvm_pagezero(pg); + + return pg; +} + +void +uvm_pmr_cache_free(struct uvm_pmr_cache_item *upci) +{ + struct pglist pgl; + int i; + + TAILQ_INIT(&pgl); + for (i = 0; i < upci->upci_npages; i++) + TAILQ_INSERT_TAIL(&pgl, upci->upci_pages[i], pageq); + + uvm_pmr_freepageq(&pgl); + + atomic_sub_int(&uvmexp.percpucaches, upci->upci_npages); + upci->upci_npages = 0; + memset(upci->upci_pages, 0, sizeof(upci->upci_pages)); +} + +void +uvm_pmr_cache_put(struct vm_page *pg) +{ + struct uvm_pmr_cache *upc = &curcpu()->ci_uvm; + struct uvm_pmr_cache_item *upci; + + upci = &upc->upc_magz[upc->upc_actv]; + if (upci->upci_npages >= UVM_PMR_CACHEMAGSZ) { + unsigned int prev; + + prev = (upc->upc_actv == 0) ? 1 : 0; + upci = &upc->upc_magz[prev]; + if (upci->upci_npages > 0) + uvm_pmr_cache_free(upci); + + /* Swap magazines */ + upc->upc_actv = prev; + KASSERT(upci->upci_npages == 0); + } + + upci->upci_pages[upci->upci_npages] = pg; + upci->upci_npages++; + atomic_inc_int(&uvmexp.percpucaches); +} + +void +uvm_pmr_cache_drain(void) +{ + struct uvm_pmr_cache *upc = &curcpu()->ci_uvm; + + uvm_pmr_cache_free(&upc->upc_magz[0]); + uvm_pmr_cache_free(&upc->upc_magz[1]); +} + +#else /* !(MULTIPROCESSOR && __HAVE_UVM_PERCPU) */ + +struct vm_page * +uvm_pmr_cache_get(int flags) +{ + return uvm_pmr_getone(flags); +} + +void +uvm_pmr_cache_put(struct vm_page *pg) +{ + uvm_pmr_freepages(pg, 1); +} + +void +uvm_pmr_cache_drain(void) +{ +} +#endif /* MULTIPROCESSOR */ Index: sys/uvm/uvm_pdaemon.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_pdaemon.c,v diff -u -p -r1.110 uvm_pdaemon.c --- sys/uvm/uvm_pdaemon.c 24 Mar 2024 10:29:35 -0000 1.110 +++ sys/uvm/uvm_pdaemon.c 30 Mar 2024 12:53:39 -0000 @@ -80,6 +80,7 @@ #endif #include +#include #include "drm.h" @@ -262,6 +263,8 @@ uvm_pageout(void *arg) #if NDRM > 0 drmbackoff(size * 2); #endif + uvm_pmr_cache_drain(); + /* * scan if needed */ Index: sys/uvm/uvm_percpu.h =================================================================== RCS file: sys/uvm/uvm_percpu.h diff -N sys/uvm/uvm_percpu.h --- /dev/null 1 Jan 1970 00:00:00 -0000 +++ sys/uvm/uvm_percpu.h 30 Mar 2024 12:54:47 -0000 @@ -0,0 +1,45 @@ +/* $OpenBSD$ */ + +/* + * Copyright (c) 2024 Martin Pieuchot + * + * Permission to use, copy, modify, and distribute this software for any + * purpose with or without fee is hereby granted, provided that the above + * copyright notice and this permission notice appear in all copies. + * + * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES + * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF + * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR + * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES + * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN + * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF + * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. + */ + +#ifndef _UVM_UVM_PCPU_H_ +#define _UVM_UVM_PCPU_H_ + +/* + * We want a per-CPU cache size to be as small as possible and at the + * same time gets rid of the `uvm_lock_fpageq' contention. + */ +#define UVM_PMR_CACHEMAGSZ 8 /* # of pages in a magazine */ + +struct vm_page; + +/* Magazine */ +struct uvm_pmr_cache_item { + struct vm_page *upci_pages[UVM_PMR_CACHEMAGSZ]; + int upci_npages; /* # of pages in magazine */ +}; + +/* Per-CPU cache */ +struct uvm_pmr_cache { + struct uvm_pmr_cache_item upc_magz[2]; /* magazines */ + int upc_actv; /* index of active magazine */ + +}; + +void uvm_pmr_cache_drain(void); + +#endif /* _UVM_UVM_PCPU_H_ */ Index: sys/uvm/uvmexp.h =================================================================== RCS file: /cvs/src/sys/uvm/uvmexp.h,v diff -u -p -r1.12 uvmexp.h --- sys/uvm/uvmexp.h 24 Mar 2024 10:29:35 -0000 1.12 +++ sys/uvm/uvmexp.h 29 Mar 2024 21:04:16 -0000 @@ -66,7 +66,7 @@ struct uvmexp { int zeropages; /* [F] number of zero'd pages */ int reserve_pagedaemon; /* [I] # of pages reserved for pagedaemon */ int reserve_kernel; /* [I] # of pages reserved for kernel */ - int unused01; /* formerly anonpages */ + int percpucaches; /* [a] # of pages in per-CPU caches */ int vnodepages; /* XXX # of pages used by vnode page cache */ int vtextpages; /* XXX # of pages used by vtext vnodes */ @@ -101,8 +101,8 @@ struct uvmexp { int syscalls; /* system calls */ int pageins; /* [p] pagein operation count */ /* pageouts are in pdpageouts below */ - int unused07; /* formerly obsolete_swapins */ - int unused08; /* formerly obsolete_swapouts */ + int pcphit; /* [a] # of pagealloc from per-CPU cache */ + int pcpmiss; /* [a] # of times a per-CPU cache was empty */ int pgswapin; /* pages swapped in */ int pgswapout; /* pages swapped out */ int forks; /* forks */ Index: sys/arch/amd64/include/cpu.h =================================================================== RCS file: /cvs/src/sys/arch/amd64/include/cpu.h,v diff -u -p -r1.163 cpu.h --- sys/arch/amd64/include/cpu.h 25 Feb 2024 19:15:50 -0000 1.163 +++ sys/arch/amd64/include/cpu.h 30 Mar 2024 12:55:27 -0000 @@ -53,6 +53,7 @@ #include #include #include +#include #ifdef _KERNEL @@ -201,6 +202,8 @@ struct cpu_info { #ifdef MULTIPROCESSOR struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM]; +#define __HAVE_UVM_PERCPU + struct uvm_pmr_cache ci_uvm; /* [o] page cache */ #endif struct ksensordev ci_sensordev; Index: sys/arch/arm64/include/cpu.h =================================================================== RCS file: /cvs/src/sys/arch/arm64/include/cpu.h,v diff -u -p -r1.43 cpu.h --- sys/arch/arm64/include/cpu.h 25 Feb 2024 19:15:50 -0000 1.43 +++ sys/arch/arm64/include/cpu.h 30 Mar 2024 12:55:55 -0000 @@ -108,6 +108,7 @@ void arm32_vector_init(vaddr_t, int); #include #include #include +#include struct cpu_info { struct device *ci_dev; /* Device corresponding to this CPU */ @@ -161,6 +162,8 @@ struct cpu_info { #ifdef MULTIPROCESSOR struct srp_hazard ci_srp_hazards[SRP_HAZARD_NUM]; +#define __HAVE_UVM_PERCPU + struct uvm_pmr_cache ci_uvm; volatile int ci_flags; volatile int ci_ddb_paused;