From: Theo Buehler Subject: Re: Fix swapping for MP archs. To: Martin Pieuchot Cc: tech@openbsd.org Date: Sun, 24 Nov 2024 03:03:15 +0100 On Sat, Nov 09, 2024 at 01:23:45PM +0100, Martin Pieuchot wrote: > The current swapping algorithm is broken for MP architectures. However > the bouncing logic, and it sleeping point, in uvm_swap_io() hides it. > > In order to see the degenerative behavior you need: > > - A MP machine with more than 2 CPUs > - No high/low memory split and a lot of memory > - Disable swap encryption (sysctl vm.swapencrypt.enable=0) > > When all these criteria are met the kernel ends up doing ping-pong for > every buffer between the page daemon, the aiodone daemon and the disk > interrupt handler, all currently needing the KERNEL_LOCK(). > > To fix this degenerative behavior, where the page daemon is starving the > aiodone daemon and interrupt handler, we need to account for in-flight > pages. Diff below does that by accounting for pages being currently > written to disk in the "shortage" calculation. > > This diff makes swapping works as expected, no degenerative behavior, no > over-swapping on MP machine with the above mentioned specification. I > believe it finally fixes swapping for 64 bits archs with a ton of memory. > > Note that I deliberately did not use atomic_load_int() for reading > `uvmexp.paging'. I do not want to introduce too many differences with > NetBSD for the moment and only a single thread (the page daemon) is > reading the counter. > > This diff is a requirement for in-place encryption which, obviously, > does not use bouncing. > > While here, release the KERNEL_LOCK() in the aiodone daemon when not > needed. > > ok? This reads fine except for the lines already called out by claudio. I suppose the whitespace changes reduce the diff to netbsd, so I don't mind those. However, I find it really hard to convince myself of the correctness of this line: > + atomic_add_int(&uvmexp.paging, -npages); I have no idea why atomic_add_int() has this strange signature taking an unsigned int for both arguments (a trap that just bit mvs in his sysctl unlocking and that isn't sorted out yet). So, the positive npages is negated, promoted to an unsigned int, added to an int that is also promoted to an unsigned int, we rely on unsigned overflow to work out in our favor and I *think* this all works out to what we want, but it makes my head spin. Like claudio I would prefer using atomic_sub_int() here unless you have a very good reason not to.