Index | Thread | Search

From:
Robert Nagy <robert@openbsd.org>
Subject:
Re: per-CPU page caches for page faults
To:
David Gwynne <david@gwynne.id.au>
Cc:
Martin Pieuchot <mpi@openbsd.org>, Openbsd Tech <tech@openbsd.org>
Date:
Mon, 1 Apr 2024 09:36:40 +0200

Download raw body.

Thread
On 01/04/24 12:15 +1000, David Gwynne wrote:
> 
> 
> > On 1 Apr 2024, at 03:00, Martin Pieuchot <mpi@openbsd.org> wrote:
> > 
> > On 19/03/24(Tue) 15:06, David Gwynne wrote:
> >> On Mon, Mar 18, 2024 at 08:13:43PM +0100, Martin Pieuchot wrote:
> >>> Diff below attaches a 16 page array to the "struct cpuinfo" and uses it
> >>> as a cache to reduce contention on the global pmemrange mutex.
> >>> 
> >>> Measured performance improvements are between 7% to 13% with 16 CPUs
> >>> and 19% to 33% with 32 CPUs.  -current OpenBSD doesn't scale above 32
> >>> CPUs so it wouldn't be fair to compare number of jobs spread across
> >>> more CPUs.  However, as you can see below, this limitation is no longer
> >>> true with this diff.
> >>> 
> >>> kernel
> >>> ------
> >>> 16:     1m47.93s real    11m24.18s user    10m55.78s system
> >>> 32:     2m33.30s real    11m46.08s user    32m32.35s system (BC cold)
> >>>        2m02.36s real    11m55.12s user    21m40.66s system
> >>> 64:     2m00.72s real    11m59.59s user    25m47.63s system
> >>> 
> >>> libLLVM
> >>> -------
> >>> 16:     30m45.54s real   363m25.35s user   150m34.05s system
> >>> 32:     24m29.88s real   409m49.80s user   311m02.54s system
> >>> 64:     29m22.63s real   404m16.20s user   771m31.26s system
> >>> 80:     30m12.49s real   398m07.01s user   816m01.71s system
> >>> 
> >>> kernel+percpucaches(16)
> >>> ------
> >>> 16: 1m30.17s real    11m19.29s user     6m42.08s system
> >>> 32:     2m02.28s real    11m42.13s user    23m42.64s system (BC cold)
> >>>        1m22.82s real    11m41.72s user     8m50.12s system
> >>> 64:     1m23.47s real    11m56.99s user     9m42.00s system
> >>> 80:     1m24.63s real    11m44.24s user    10m38.00s system
> >>> 
> >>> libLLVM+percpucaches(16)
> >>> -------
> >>> 16:     28m38.73s real   363m34.69s user    95m45.68s system
> >>> 32:     19m57.71s real   415m17.23s user   174m47.83s system
> >>> 64:     18m59.50s real   450m17.79s user   406m05.42s system
> >>> 80:     19m02.26s real   452m35.11s user   473m09.05s system
> >>> 
> >>> Still the most important impact of this diff is the reduction of %sys
> >>> time.  It drops from ~40% with 16 CPUs and ~55% with 32 CPUs or more.
> >>> 
> >>> What is the idea behind this diff?  With a consequent number of CPUs (16
> >>> or more) grabbing a global mutex for every page allocation & free creates
> >>> a lot of contention resulting in many CPU cycles wasted in system (kernel)
> >>> time.  The idea of this diff is to add another layer on top of the global
> >>> allocator to allocate and free pages in batch.  Note that, in this diff,
> >>> this cache is only used for page faults.
> >>> 
> >>> The number of 16 has been chosen after careful testing on a 80 CPU Ampere
> >>> machine.  It tried to keep it as small as possible while making sure that
> >>> multiple parallel page faults on a large number of CPUs do not result in
> >>> contention.  I'd argue that "stealing" at most 64k per CPU is acceptable
> >>> on any MP system.
> >>> 
> >>> The diff includes 3 new counters visible in "systat uvm" and "vmstat -s".
> >>> 
> >>> When the page daemon kicks in we drain the cache of the current CPU which
> >>> is the best we can do without adding too much complexity.
> >>> 
> >>> I only tested amd64 and arm64, that's why there is such define in
> >>> uvm/uvm_page.c.  I'd be happy to hear from tests on other architectures
> >>> and different topologies.  You'll need to edit $arch/include/cpu.h and
> >>> modify the define.
> >>> 
> >>> This diff is really interesting because it now allows us to clearly see
> >>> which syscall are contenting a lot.  Without surprise it's kbind(2),
> >>> munmap(2) and mprotect(2).  It also shows which workloads are VFS-bound.
> >>> That is what the "Buffer-Cache Cold" (BC Cold) numbers represent above.
> >>> With a small number of CPUs we don't see much difference between the two.
> >>> 
> >>> Comments?
> >> 
> >> i like the idea, and i like the improvements.
> >> 
> >> this is basically the same problem that jeff bonwick deals with in
> >> his magazines and vmem paper about the changes he made to the solaris
> >> slab allocator to make it scale on machines with a bunch of cpus.
> >> that's the reference i used when i implemented per cpu caches in
> >> pools, and it's probably worth following here as well. the only
> >> real change i'd want you to make is to introduce the "previously
> >> loaded magazine" to mitigate thrashing as per section 3.1 in the
> >> paper.
> >> 
> >> pretty exciting though.
> > 
> > New version that should address all previous comments:
> > 
> > - Use 2 magazines of 8 pages and imitate the pool_cache code.  The
> >  miss/hit ratio can be observed to be 1/8 with "systat uvm".
> > 
> > - Ensure that uvm_pmr_getpages() won't fail with highly fragmented
> >  memory and do not wakup the pagedaemon if it fails to fully reload a
> >  magazine.
> > 
> > - Use __HAVE_UVM_PERCPU & provide UP versions of cache_get/cache_put().
> > 
> > - Change amap_wipeout() to call uvm_anfree() to fill the cache instead of
> >  bypassing it by calling uvm_pglistfree(). 
> > 
> > - Include a fix for incorrect decrementing of `uvm.swpgonly' in
> >  uvm_anon_release() (should be committed independently).
> > 
> > I didn't do any measurement with this version but robert@ said it shave
> > off 30 minutes compared to the previous one for a chromium build w/ 32
> > CPUs (from 4.5h down to 4h).
> 
> so a chromium build with your first diff is 4.5h? or a vanilla kernel is 4.5h?

that's the difference between the first and the second diff