Download raw body.
per-CPU page caches for page faults
On 01/04/24 12:15 +1000, David Gwynne wrote: > > > > On 1 Apr 2024, at 03:00, Martin Pieuchot <mpi@openbsd.org> wrote: > > > > On 19/03/24(Tue) 15:06, David Gwynne wrote: > >> On Mon, Mar 18, 2024 at 08:13:43PM +0100, Martin Pieuchot wrote: > >>> Diff below attaches a 16 page array to the "struct cpuinfo" and uses it > >>> as a cache to reduce contention on the global pmemrange mutex. > >>> > >>> Measured performance improvements are between 7% to 13% with 16 CPUs > >>> and 19% to 33% with 32 CPUs. -current OpenBSD doesn't scale above 32 > >>> CPUs so it wouldn't be fair to compare number of jobs spread across > >>> more CPUs. However, as you can see below, this limitation is no longer > >>> true with this diff. > >>> > >>> kernel > >>> ------ > >>> 16: 1m47.93s real 11m24.18s user 10m55.78s system > >>> 32: 2m33.30s real 11m46.08s user 32m32.35s system (BC cold) > >>> 2m02.36s real 11m55.12s user 21m40.66s system > >>> 64: 2m00.72s real 11m59.59s user 25m47.63s system > >>> > >>> libLLVM > >>> ------- > >>> 16: 30m45.54s real 363m25.35s user 150m34.05s system > >>> 32: 24m29.88s real 409m49.80s user 311m02.54s system > >>> 64: 29m22.63s real 404m16.20s user 771m31.26s system > >>> 80: 30m12.49s real 398m07.01s user 816m01.71s system > >>> > >>> kernel+percpucaches(16) > >>> ------ > >>> 16: 1m30.17s real 11m19.29s user 6m42.08s system > >>> 32: 2m02.28s real 11m42.13s user 23m42.64s system (BC cold) > >>> 1m22.82s real 11m41.72s user 8m50.12s system > >>> 64: 1m23.47s real 11m56.99s user 9m42.00s system > >>> 80: 1m24.63s real 11m44.24s user 10m38.00s system > >>> > >>> libLLVM+percpucaches(16) > >>> ------- > >>> 16: 28m38.73s real 363m34.69s user 95m45.68s system > >>> 32: 19m57.71s real 415m17.23s user 174m47.83s system > >>> 64: 18m59.50s real 450m17.79s user 406m05.42s system > >>> 80: 19m02.26s real 452m35.11s user 473m09.05s system > >>> > >>> Still the most important impact of this diff is the reduction of %sys > >>> time. It drops from ~40% with 16 CPUs and ~55% with 32 CPUs or more. > >>> > >>> What is the idea behind this diff? With a consequent number of CPUs (16 > >>> or more) grabbing a global mutex for every page allocation & free creates > >>> a lot of contention resulting in many CPU cycles wasted in system (kernel) > >>> time. The idea of this diff is to add another layer on top of the global > >>> allocator to allocate and free pages in batch. Note that, in this diff, > >>> this cache is only used for page faults. > >>> > >>> The number of 16 has been chosen after careful testing on a 80 CPU Ampere > >>> machine. It tried to keep it as small as possible while making sure that > >>> multiple parallel page faults on a large number of CPUs do not result in > >>> contention. I'd argue that "stealing" at most 64k per CPU is acceptable > >>> on any MP system. > >>> > >>> The diff includes 3 new counters visible in "systat uvm" and "vmstat -s". > >>> > >>> When the page daemon kicks in we drain the cache of the current CPU which > >>> is the best we can do without adding too much complexity. > >>> > >>> I only tested amd64 and arm64, that's why there is such define in > >>> uvm/uvm_page.c. I'd be happy to hear from tests on other architectures > >>> and different topologies. You'll need to edit $arch/include/cpu.h and > >>> modify the define. > >>> > >>> This diff is really interesting because it now allows us to clearly see > >>> which syscall are contenting a lot. Without surprise it's kbind(2), > >>> munmap(2) and mprotect(2). It also shows which workloads are VFS-bound. > >>> That is what the "Buffer-Cache Cold" (BC Cold) numbers represent above. > >>> With a small number of CPUs we don't see much difference between the two. > >>> > >>> Comments? > >> > >> i like the idea, and i like the improvements. > >> > >> this is basically the same problem that jeff bonwick deals with in > >> his magazines and vmem paper about the changes he made to the solaris > >> slab allocator to make it scale on machines with a bunch of cpus. > >> that's the reference i used when i implemented per cpu caches in > >> pools, and it's probably worth following here as well. the only > >> real change i'd want you to make is to introduce the "previously > >> loaded magazine" to mitigate thrashing as per section 3.1 in the > >> paper. > >> > >> pretty exciting though. > > > > New version that should address all previous comments: > > > > - Use 2 magazines of 8 pages and imitate the pool_cache code. The > > miss/hit ratio can be observed to be 1/8 with "systat uvm". > > > > - Ensure that uvm_pmr_getpages() won't fail with highly fragmented > > memory and do not wakup the pagedaemon if it fails to fully reload a > > magazine. > > > > - Use __HAVE_UVM_PERCPU & provide UP versions of cache_get/cache_put(). > > > > - Change amap_wipeout() to call uvm_anfree() to fill the cache instead of > > bypassing it by calling uvm_pglistfree(). > > > > - Include a fix for incorrect decrementing of `uvm.swpgonly' in > > uvm_anon_release() (should be committed independently). > > > > I didn't do any measurement with this version but robert@ said it shave > > off 30 minutes compared to the previous one for a chromium build w/ 32 > > CPUs (from 4.5h down to 4h). > > so a chromium build with your first diff is 4.5h? or a vanilla kernel is 4.5h? that's the difference between the first and the second diff
per-CPU page caches for page faults